Regular expressions | Extracting dates from text - regex

I am trying to extract dates from several articles. When I test the regular expression the pattern match only part of the information of interest. As you can see:
https://regex101.com/r/ATgIeZ/2
This is a sample of the text file:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] 3004
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo JULY 14, 2034
The extraction pattern that I am using and the code is this one:
import re
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
pattern = ("[A-Z]+\.*\s(\d+)\,\s(\d+){4}")
result = re.findall(pattern,text_read)
print(result)
And the output from Anaconda is:
[('5', '6'), ('7', '5'), ('1', '6'), .....]
The expected output is:
OCT. 5, 2016, FEB. 8, 2016, JULY 14, 2034 .....

Problem is the repeat command {4} which is outside your last group. Also, the regex to capture the month was not within a group
Fix it like this:
pattern = r"([A-Z]+)\.?\s(\d+)\,\s(\d{4})"
result with your data sample:
[('OCT', '5', '2016'), ('FEB', '8', '2016'), ('JULY', '14', '2034')]
Small extra fixes:
there can be 0 or 1 dot. So removed \.* for \.?
used "raw" prefix, always better when defining regexes string (no problems here, but can happen with \b for instance)

Thanks for the suggestions, it helped to understand the use of parenthesis in regex.
I solved my self with this:
pattern=("([A-Z]+\.*\s)(\d+)\,\s(\d{4})")

Related

Using regex to match groups that may not contain given text patterns

I'd like to use regex to extract birth dates and places, as well as (when they're defined) death dates and places, from a collection of encyclopedia entries. Here are some examples of such entries which illustrate the patterns that I'm trying to codify:
William Wordsworth, (born April 7, 1770, Cockermouth, Cumberland, England—died April 23, 1850, Rydal Mount, Westmorland), English poet...
Jane Goodall, in full Dame Jane Goodall, original name Valerie Jane Morris-Goodall, (born April 3, 1934, London, England), British ethologist...
Kenneth Wartinbee Spence, (born May 6, 1907, Chicago, Illinois, U.S.—died January 12, 1967, Austin, Texas), American psychologist...
I was hoping that the following regex pattern would identify the desired capture groups:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(?:—died )?(\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
But sadly, it does not. (Use https://regexr.com/ to view the results.)
Note: When I restructure the pattern around the phrase —died in the following way, the pattern does produce the expected results for entries like the first and third above (those with given death dates/places), but it obviously does not work in all cases.
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
What am I missing?
In general, you can mark the whole --died section as optional:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?))?\)
https://regexr.com/765fc
You can use the negative lookahead assertion (?!) to match groups that do not contain a given text pattern. The syntax for this is (?!pattern) where "pattern" is the text you want to exclude from the match. For example, if you want to match all groups of characters that do not contain the letter "a", you would use the regex "(?!a).*" This will match any group of characters that does not have an "a" in it.

Modify Regex To Match String with Multiple Full Month Names And End on Hyphen

I am using a Regex to pull dates out of a series of strings. The format varies slightly, but it always contains the full month. The strings usually contain two dates to represent a range like so:
February 1, 2020 - March 18, 2020
or
February 1st 2020 - March 18th 2020
And this is working great until I come across dates like:
June 1 - July 22, 2018
where a year is not presented in the "starting" part of the range because it is the same as the "ending" year.
Below is the Regex I crudely copied and applied to my code. It is Javascript but I really think this is more of a Regex question...
const regex = /((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?((\s*[,.\-\/]\s*)\D?)?\s*((19[0-9]\d|20\d{2})|\d{2})*/gm;
var myDateString1 = "January 8, 2020 - January 27, 2020"; // THIS WORKS GREAT!
var myDateString2 = "January 8 - January 27, 2020"; // THIS DOES NOT WORK GREAT!
var dates = myDateString1.match(regex);
// returns ["January 8, 2020","January 27, 2020"]
var dates2 = myDateString2.match(regex);
// returns ["January 8 - J"]
Is there a way I can modify this so if it is met with a hyphen it discontinues that given match? So myDateString2 would return ["January 8", "January 27, 2020"]?
The strings sometimes have words before or after, like
Presented from January 8, 2020 - January 27, 2020 at such and such place
so I don't think simply having a regex based on the hyphen before/after would work.
You could use 2 capture groups and make the pattern more specific to match the format of the strings.
The /m flag can be omitted as there are no anchors in the pattern.
Note that the pattern matches a date like pattern, and does not validate the date itself.
\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b
See a regex101 demo.
const regex = /\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b/g;
const str = `January 8, 2020 - January 27, 2020
January 8 - January 27, 2020
Presented from January 8, 2020 - January 27, 2020 at such and such place
June 1 - July 22, 2018`;
console.log(Array.from(str.matchAll(regex), m => [m[1], m[2]]))
Note- original regex, tried to be all match of forms, which is not possible like this. I reformed it to do 75% of original intent. But is fools gold et all, in the end ..
The capture groups were used for debug.
Simply taking out the hyphen in the class and making the year optional at the end with a single ? should get what you want.
/((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?(((\s*[,./]))?\s+(19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/6NiNxy/1
And replacing the capture groups with clusters (?: ) then giving it one more level of factoring will make it quicker.
/(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/NTR0WD/1
const regex = /(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/g;
var myDateString1 = "January 8, 2020 - January 27, 2020";
var myDateString2 = "January 8 - January 27, 2020";
console.log(myDateString1.match(regex));
console.log(myDateString2.match(regex));

Matching location with regular expressions | Python

I scraped several articles from a websites. Now I am trying to extract the location of the news. The location are written either capitalized with just the capital of the country (e.g. "BRUSSELS-") or in some cases along with the country (e.g. "BRUSELLS, Belgium-")
This is a sample of the articles:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...]
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Monday and wounded]
The regular expression I used is this one:
text_open = open("Training_News_6.csv")
text_read = text_open.read()
pattern = ("[A-Z]{1,}\w+\s\—")
result = re.findall(pattern,text_read)
print(result)
The reason why I used the score sign (-) is because is a recurrent pattern that links to the location.
However, this regular expression manage to extract "BRUSSELS -" but when it comes to "KABUL, Afghanistan -" it only extract the last part, namely "Afghanistan -".
In the second case I would like to extract the whole location: the capital and the country. Any idea?
You may use
([A-Z]+(?:\W+\w+)?)\s*—
See the regex demo
Details:
([A-Z]+(?:\W+\w+)?) - Capture Group 1 (the contents of which will be returned as the result of re.findall) capturing
[A-Z]+ - 1 or more ASCII uppercase letters
(?:\W+\w+)? - 1 or 0 occurrences (due to ? quantifier) of 1+ non-word chars (\W+) and 1+ word chars (\w+)
\s* - 0+ whitespaces
— - a — symbol
Python demo:
import re
rx = r"([A-Z]+(?:\W+\w+)?)\s*—"
s = "|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016 \n, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] \n[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 \n, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo"
print(re.findall(rx, s)) # => ['BRUSSELS', 'KABUL, Afghanistan']
One thing you could do is adding , and \s to your first chars selection, and then stripping all whitespaces and commas from the left.,[A-Z,\s]{1,}\w+\s\—
Or even something simpler, like this:,(.+)\—. $1 would be your match, containing extra symbols. Another option which might work: ,\s*([A-Za-z]*[,\s]*[A-Za-z]*)\s\— or simplified versions: ,\s*([A-Za-z,\s]*)\s\—. Once again $1 is your match.

Regex to find capitalized words and code string?

I'm looking to change a simple dash to an em dash in some obituaries we receive. But it's only after the City of Death where that em dash should go.
The text looks like this:
#M_DeathNoticeHed:Alex <\n>Ornelas
#M_DeathNoticeBod:ALAMO <\!-> Alex Ornelas <\n>, 25, died Tuesday, Aug. <\n>16, 2016 at Alamo. Me<\h>morial Funeral Home of <\n>San Juan is in charge of ar<\h>rangements.
#M_DeathNoticeHed:Almaquire Cadena
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Almaquire <\n>Cadena , 87, died Tues<\h>day, Aug. 16, 2016 at Pax <\n>Villa Hospice, in McAllen, <\n>TX. Sanchez Funeral Home <\n> of Rio Grande City is in <\n>charge of arrangements.
#M_DeathNoticeHed:AnaRose <\n>Collazi
#M_DeathNoticeBod:MISSION <\!-> AnaRose <\n>Collazo , 44, died Wednes<\h>day, Aug. 17, 2016 at Mis<\h>sion Regional Medical Cen<\h>ter in Mission. Virgil Wilson <\n>Mortuary of Mission is in <\n>charge of arrangements.
#M_DeathNoticeHed:Andy Garza
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Andy <\n>Garza , 21, died Tuesday, <\n>Aug. 16, 2016 at Chicago, <\n>IL. Rodriguez Funeral <\n>Home of Roma is in <\n>charge of arrangements.
Notice that after every "#M_DeathNoticeBod: CITY" is "<\!->" which symbolizes the dash that I need changed to an em dash.
My regex code is not getting the "<\!->" selected along with the preceding city and "#M_DeathNoticeHed:".
#M_DeathNoticeBod:([^A-Za-z]*?[A-Z][A-Za-z]*)([^A-Za-z]*?[A-Z][A-Za-z]*) [<\!->]
It is also not selecting cities with 3 names in it like "RIO GRANDE CITY". I'm selecting this because the dash appears in other spots in the file that I do not want replaced.
If I can select that section I can replace the dash here.
If the lines you care about always start with "#M_DeathNoticeBod:" followed by the city of death, followed by the <!-> you wish to replace, I think something simple would do the job:
(#M_DeathNoticeBod:.*)<\\!->
Capture group 1 will contain everything up until that first "<\!->", so you if you're doing a search and replace you can just replace each occurrence of that regex with the contents of group 1 (usually denoted by '\1') followed by an em dash.
This regex should do:
#M_DeathNoticeBod:([A-Z ]*) (<\\!->)
I think this is what you are actually looking for:
(?<=#M_DeathNoticeBod:).+<\\!->
To explain things, the first portion, (?<=#M_DeathNoticeBod:) within the parenthesis is the positive lookbehind that doesn't participate in matching but ensures the trailing portion shall always be preceded by that expression.
I the trailing portion .+ should capture any city name containing any character sequence followed by your <!-> delimiter, which is captured by <\\!-> regex.

Matching everything but a time range using regex

I have a regular expression that matches time ranges:
(([0-2]?[0-9]:[0-5][0-9]\s*)-(\s*[0-2]?[0-9]:[0-5][0-9]))
However, I need a regex that can extract everything except the time range in a string such as: "June 12, 2015 13:30 - 14:00" (i.e., "June 12, 2015"), but I can't seem to do it.
I tried using both a lookahead and lookbehind, such as the following, but they don't seem to work (at least for me ;).
(?<!(([0-2]?[0-9]:[0-5][0-9]\s*)-(\s*[0-2]?[0-9]:[0-5][0-9])))
Thanks to #OozeMeister for the hint. I'm coding in Ruby, and code substitution is the solution.
t = "June 12, 2015 13:30 - 14:00"
reg = /(([0-2]?[0-9]:[0-5][0-9]\s*)-(\s*[0-2]?[0-9]:[0-5][0-9]))/
date = t.gsub(reg, '')
puts date
Output: "June 12, 2015 "