Google sheets: Parse date from text - regex

Google Sheets - Parsing
From given text, how do you extract the date?
Given text
Extracted date (to be generated)
Graduation reunion on Saturday, September 10, 2022 at 123 Front Street
September 10, 2022
BBQ Party on Sunday October 1, 2022 at 213 South Street
October 1, 2022
Google Sheets link
--
I've tried
=regexextract(A2,"\w{9} \d{2}, \d{4}")*1
As shown in the Google Sheets, this only works for the first one which is September 10, 2022. However, not all months have the same number of characters.

You may use either of the below:
Here, you have to drag down for the formula to populate below
=REGEXEXTRACT(A2,", (.*?) at")
while the code below, will automatically expand on the column
=ARRAYFORMULA(IF(A2:A="","",REGEXEXTRACT(A2:A,", (.*?) at")))
The formula, will take the characters after the first comma until 'at'.

try:
=INDEX(IFNA(REGEXEXTRACT(A2:A, "(\w+ \d+, \d{4})")*1))

Related

Modify Regex To Match String with Multiple Full Month Names And End on Hyphen

I am using a Regex to pull dates out of a series of strings. The format varies slightly, but it always contains the full month. The strings usually contain two dates to represent a range like so:
February 1, 2020 - March 18, 2020
or
February 1st 2020 - March 18th 2020
And this is working great until I come across dates like:
June 1 - July 22, 2018
where a year is not presented in the "starting" part of the range because it is the same as the "ending" year.
Below is the Regex I crudely copied and applied to my code. It is Javascript but I really think this is more of a Regex question...
const regex = /((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?((\s*[,.\-\/]\s*)\D?)?\s*((19[0-9]\d|20\d{2})|\d{2})*/gm;
var myDateString1 = "January 8, 2020 - January 27, 2020"; // THIS WORKS GREAT!
var myDateString2 = "January 8 - January 27, 2020"; // THIS DOES NOT WORK GREAT!
var dates = myDateString1.match(regex);
// returns ["January 8, 2020","January 27, 2020"]
var dates2 = myDateString2.match(regex);
// returns ["January 8 - J"]
Is there a way I can modify this so if it is met with a hyphen it discontinues that given match? So myDateString2 would return ["January 8", "January 27, 2020"]?
The strings sometimes have words before or after, like
Presented from January 8, 2020 - January 27, 2020 at such and such place
so I don't think simply having a regex based on the hyphen before/after would work.
You could use 2 capture groups and make the pattern more specific to match the format of the strings.
The /m flag can be omitted as there are no anchors in the pattern.
Note that the pattern matches a date like pattern, and does not validate the date itself.
\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b
See a regex101 demo.
const regex = /\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b/g;
const str = `January 8, 2020 - January 27, 2020
January 8 - January 27, 2020
Presented from January 8, 2020 - January 27, 2020 at such and such place
June 1 - July 22, 2018`;
console.log(Array.from(str.matchAll(regex), m => [m[1], m[2]]))
Note- original regex, tried to be all match of forms, which is not possible like this. I reformed it to do 75% of original intent. But is fools gold et all, in the end ..
The capture groups were used for debug.
Simply taking out the hyphen in the class and making the year optional at the end with a single ? should get what you want.
/((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?(((\s*[,./]))?\s+(19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/6NiNxy/1
And replacing the capture groups with clusters (?: ) then giving it one more level of factoring will make it quicker.
/(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/NTR0WD/1
const regex = /(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/g;
var myDateString1 = "January 8, 2020 - January 27, 2020";
var myDateString2 = "January 8 - January 27, 2020";
console.log(myDateString1.match(regex));
console.log(myDateString2.match(regex));

How do I capture the Months with proper Regex?

How do I capture the days of months as numbers, excluding any suffixes. For instance - January 11th would be 11, and March 25th would be 25.
You could use the regex string and then only use the 3rd capturing group.
We accept 3 letter months Jan 1st and full name January 1st and accept space, hyphen,comma or slash as in Jan 01 Jan-01 Jan,1st Jan/31
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(uary|ruary|ch|il|e|y|ust|tember|ember)?[ \/,-]{0,2}([0-3]?[0-9])
You would do better to look for native time manipulation if possible.

Regex to find date string

I have many sites on which I need to find the date, all of these sites have different templates, so I need a regular expression, here are examples of how dates are displayed on these sites:
Saturday, March 24, 2007
1 JANUARY 2016
31st December 2016
23 Agustus 2019
2012年5月7日
August 23, 2019
I tried to do something like this:
re.search(r"((\w+\s\w+(,\s|\s)\w+)|(\w+[0-9]\w))", text)
But during the test, I got this:
2014 jQuery Foundation
81vy4jRyxBHyxIhY67E
How to write a regular expression in my case?
You might have to write some custom expressions, then use alternation, maybe a bit similar to:
^[A-Z][A-Za-z]+[\s,]*[A-Z][A-Za-z]+[\s,]*\d+[\s,]*\d{4}|\d+[A-Za-z]*[\s,]*[A-Z][A-Za-z]+[\s,]*\d{4}|[A-Z][A-Za-z]*[\s,]*\d+[\s,]*\d{4}|\d{4}\D+\d+\D+\d+\D+$
which would likely fail for some instances, which you might want to adjust for. It'd be much better to add much more boundaries.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
Test
import re
regex = r"^[A-Z][A-Za-z]+[\s,]*[A-Z][A-Za-z]+[\s,]*\d+[\s,]*\d{4}|\d+[A-Za-z]*[\s,]*[A-Z][A-Za-z]+[\s,]*\d{4}|[A-Z][A-Za-z]*[\s,]*\d+[\s,]*\d{4}|\d{4}\D+\d+\D+\d+\D+$"
test_str = """
Saturday, March 24, 2007
1 JANUARY 2016
31st December 2016
23 Agustus 2019
2012年5月7日
August 23, 2019
2014 jQuery Foundation
81vy4jRyxBHyxIhY67E
"""
print(re.findall(regex, test_str, re.M))
Output
['Saturday, March 24, 2007', '1 JANUARY 2016', '31st December 2016', '23 Agustus 2019', '2012年5月7日 ', 'August 23, 2019']
RegEx Circuit
jex.im visualizes regular expressions:

Python Regex. Capturing text between matches issue

Quite new to regex and having trouble getting the right match.
I have the following string:
The AGM will be held at the Company's registered office at Unity House, Telford Road, Basingstoke, Hampshire, RG21 6YJ on 13 January 2016 at 10.00 a.m.
The Company announces that its 2016 Annual General Meeting will be held on 11 February 2016 at 10.00 a.m. at Hangar 89, London Luton Airport, Luton, Bedfordshire, LU2 9PF.
I am trying to extract the address from the last occurrence of 'at' till the postcode. So Unity House, Telford Road, Basingstoke, Hampshire, RG21 6YJ and Hangar 89, London Luton Airport, Luton, Bedfordshire, LU2 9PF
This is what I use (at)(?!.*at)(.*)\s([A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2})
it extracts only the second address. Any thoughts?
Thanks
Looks like you want to use ((?:(?!at).)*) instead (?!.*at)(.*) for avoiding to skip over at
(at)((?:(?!at).)*)\s([A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2})
See demo at regex101
If you use (at)(?!.*at)(.*) with s flag, there is only at the last at not another at ahead. So it is expected that only the last one would match. (at)((?:(?!at).)*) will not skip over another at.

Parsing dates from string - regex

I'm terible with regex and I can't seem to wrap my head around this simple task.
I need to parse out the two dates in a string which always has one of two formats:
"Inquiry at your property for December 29, 2013 - January 03, 2014"
OR
"Inquiry at your property for 29 December , 2013 - 03 January, 2014"
the 2 different date formats are throwing me off. Any insights would be appreciated!
/(\d+ \w+, \d+|\w+ \d+, \d+)/ for example. Try it out on Rubular.
For sure, it would pickup more stuff, like 2013 NotReallyAMonth, 12345. But if you don't have things in the input that look like a date, but not actually a date this might work.
You could make the regexp stronger, but applying more restrictions on what is matched:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4})/
In this case the day is always two digits, the year is 4. Months are listed explicitly (you would have to list all of them).
Update: For ranges it would be a different regexp:
/((?:Jan|Dec) \d+ - \d+, \d{4})/
Obviously they can all be combined together:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4}|(?:Jan|Dec) \d+ - \d+, \d{4})/