I have many sites on which I need to find the date, all of these sites have different templates, so I need a regular expression, here are examples of how dates are displayed on these sites:
Saturday, March 24, 2007
1 JANUARY 2016
31st December 2016
23 Agustus 2019
2012年5月7日
August 23, 2019
I tried to do something like this:
re.search(r"((\w+\s\w+(,\s|\s)\w+)|(\w+[0-9]\w))", text)
But during the test, I got this:
2014 jQuery Foundation
81vy4jRyxBHyxIhY67E
How to write a regular expression in my case?
You might have to write some custom expressions, then use alternation, maybe a bit similar to:
^[A-Z][A-Za-z]+[\s,]*[A-Z][A-Za-z]+[\s,]*\d+[\s,]*\d{4}|\d+[A-Za-z]*[\s,]*[A-Z][A-Za-z]+[\s,]*\d{4}|[A-Z][A-Za-z]*[\s,]*\d+[\s,]*\d{4}|\d{4}\D+\d+\D+\d+\D+$
which would likely fail for some instances, which you might want to adjust for. It'd be much better to add much more boundaries.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
Test
import re
regex = r"^[A-Z][A-Za-z]+[\s,]*[A-Z][A-Za-z]+[\s,]*\d+[\s,]*\d{4}|\d+[A-Za-z]*[\s,]*[A-Z][A-Za-z]+[\s,]*\d{4}|[A-Z][A-Za-z]*[\s,]*\d+[\s,]*\d{4}|\d{4}\D+\d+\D+\d+\D+$"
test_str = """
Saturday, March 24, 2007
1 JANUARY 2016
31st December 2016
23 Agustus 2019
2012年5月7日
August 23, 2019
2014 jQuery Foundation
81vy4jRyxBHyxIhY67E
"""
print(re.findall(regex, test_str, re.M))
Output
['Saturday, March 24, 2007', '1 JANUARY 2016', '31st December 2016', '23 Agustus 2019', '2012年5月7日 ', 'August 23, 2019']
RegEx Circuit
jex.im visualizes regular expressions:
Related
I am using a Regex to pull dates out of a series of strings. The format varies slightly, but it always contains the full month. The strings usually contain two dates to represent a range like so:
February 1, 2020 - March 18, 2020
or
February 1st 2020 - March 18th 2020
And this is working great until I come across dates like:
June 1 - July 22, 2018
where a year is not presented in the "starting" part of the range because it is the same as the "ending" year.
Below is the Regex I crudely copied and applied to my code. It is Javascript but I really think this is more of a Regex question...
const regex = /((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?((\s*[,.\-\/]\s*)\D?)?\s*((19[0-9]\d|20\d{2})|\d{2})*/gm;
var myDateString1 = "January 8, 2020 - January 27, 2020"; // THIS WORKS GREAT!
var myDateString2 = "January 8 - January 27, 2020"; // THIS DOES NOT WORK GREAT!
var dates = myDateString1.match(regex);
// returns ["January 8, 2020","January 27, 2020"]
var dates2 = myDateString2.match(regex);
// returns ["January 8 - J"]
Is there a way I can modify this so if it is met with a hyphen it discontinues that given match? So myDateString2 would return ["January 8", "January 27, 2020"]?
The strings sometimes have words before or after, like
Presented from January 8, 2020 - January 27, 2020 at such and such place
so I don't think simply having a regex based on the hyphen before/after would work.
You could use 2 capture groups and make the pattern more specific to match the format of the strings.
The /m flag can be omitted as there are no anchors in the pattern.
Note that the pattern matches a date like pattern, and does not validate the date itself.
\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b
See a regex101 demo.
const regex = /\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b/g;
const str = `January 8, 2020 - January 27, 2020
January 8 - January 27, 2020
Presented from January 8, 2020 - January 27, 2020 at such and such place
June 1 - July 22, 2018`;
console.log(Array.from(str.matchAll(regex), m => [m[1], m[2]]))
Note- original regex, tried to be all match of forms, which is not possible like this. I reformed it to do 75% of original intent. But is fools gold et all, in the end ..
The capture groups were used for debug.
Simply taking out the hyphen in the class and making the year optional at the end with a single ? should get what you want.
/((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?(((\s*[,./]))?\s+(19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/6NiNxy/1
And replacing the capture groups with clusters (?: ) then giving it one more level of factoring will make it quicker.
/(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/NTR0WD/1
const regex = /(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/g;
var myDateString1 = "January 8, 2020 - January 27, 2020";
var myDateString2 = "January 8 - January 27, 2020";
console.log(myDateString1.match(regex));
console.log(myDateString2.match(regex));
Google Sheets - Parsing
From given text, how do you extract the date?
Given text
Extracted date (to be generated)
Graduation reunion on Saturday, September 10, 2022 at 123 Front Street
September 10, 2022
BBQ Party on Sunday October 1, 2022 at 213 South Street
October 1, 2022
Google Sheets link
--
I've tried
=regexextract(A2,"\w{9} \d{2}, \d{4}")*1
As shown in the Google Sheets, this only works for the first one which is September 10, 2022. However, not all months have the same number of characters.
You may use either of the below:
Here, you have to drag down for the formula to populate below
=REGEXEXTRACT(A2,", (.*?) at")
while the code below, will automatically expand on the column
=ARRAYFORMULA(IF(A2:A="","",REGEXEXTRACT(A2:A,", (.*?) at")))
The formula, will take the characters after the first comma until 'at'.
try:
=INDEX(IFNA(REGEXEXTRACT(A2:A, "(\w+ \d+, \d{4})")*1))
I have an excel document with a cell with the following information:
A) Current to: Notice of 19 June 2014 Sent on: August 2012
B) Updated on: October 2018
C) Updated: 14 January 2009
I am using the following regular expression with some success:
(\b\d{1,2}\D{0,3})?\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?))\D?(\d{1,2}\D?)?\D?((19[7-9]\d|20\d{2})|\d{2})
the output I have is as follows:
A) 19 June2014August2012
B) October 2018
C) 14 January 2009
B and C extract fine, however I am expecting 19 June 2014 for A
I have tried adding .*? to make the expression less greedy, but (depending where I add the dot-star), I either get no result or I get an inaccurate response
To match the first match on a line you may add ^.*? at the beginning of the pattern, wrap what you have with capturing parentheses and set the Multiline regex property to True. Your match is inside match.Submatches(0).
regEx.Pattern = "^.*?((?:\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}\D?)?\D?(?:(?:19[7-9]\d|20\d{2})|\d{2}))"
regEx.Multiline = True
See the regex demo.
I am trying to extract dates from several articles. When I test the regular expression the pattern match only part of the information of interest. As you can see:
https://regex101.com/r/ATgIeZ/2
This is a sample of the text file:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] 3004
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo JULY 14, 2034
The extraction pattern that I am using and the code is this one:
import re
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
pattern = ("[A-Z]+\.*\s(\d+)\,\s(\d+){4}")
result = re.findall(pattern,text_read)
print(result)
And the output from Anaconda is:
[('5', '6'), ('7', '5'), ('1', '6'), .....]
The expected output is:
OCT. 5, 2016, FEB. 8, 2016, JULY 14, 2034 .....
Problem is the repeat command {4} which is outside your last group. Also, the regex to capture the month was not within a group
Fix it like this:
pattern = r"([A-Z]+)\.?\s(\d+)\,\s(\d{4})"
result with your data sample:
[('OCT', '5', '2016'), ('FEB', '8', '2016'), ('JULY', '14', '2034')]
Small extra fixes:
there can be 0 or 1 dot. So removed \.* for \.?
used "raw" prefix, always better when defining regexes string (no problems here, but can happen with \b for instance)
Thanks for the suggestions, it helped to understand the use of parenthesis in regex.
I solved my self with this:
pattern=("([A-Z]+\.*\s)(\d+)\,\s(\d{4})")
I know there are a load of similar "combine 2 regular expressions" posts on here, but I've tried the solutions and keep getting errors.
I've got the regular expressions to parse a description such as:
Org Biomol Chem. 2011 May 7;9(9):3549-59. doi: 10.1039/c1ob05128h. Epub 2011 Mar 28.
to extract the DOI (Digital Object Identifier):
([^:]+$) --> 10.1039/c1ob05128h. Epub 2011 Mar 28.
([^\s]+) --> 10.1039/c1ob05128h.
But am pretty clueless as to how to combine these. If it's difficult, then it's not necessary, but would simplify my calculations.
I also can't figure out how to get rid of that last "." which is not part of the DOI string (for the record there may be more than 2 full stops in a DOI so the regex can't simply be "after the 2nd full stop").
Some other examples as requested:
Chem Soc Rev. 2008 Nov;37(11):2413-21. doi: 10.1039/b719548f. Epub 2008 Sep 16.
Small. 2010 Dec 20;6(24):2796-820. doi: 10.1002/smll.201001881. Review.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Dec 27;16(48):14285-9. doi: 10.1002/chem.201002111. No abstract available.
All the attempts I've made so far give a result much the same as this:
Some of the exceptions to Dukeling's suggestion of "doi: ([^\s]+).? ([^:]+).?", for reasons unknown, were:
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
If you only want the . gone, this seems to work:
"doi: ([^\s]+)\."
So we're just putting the . outside the brackets, so it doesn't get grouped with the string.
If you want to extract 10.1039/c1ob05128h and Epub 2011 Mar 28 in 2 separate strings, you can do this with groups. You can make the regex something like:
"doi: ([^\s]+)\.(?: ([^:]+)\.)?"
Given that the second part appears to be optional, we need to surround it with brackets which we mark as optional with ? (and the ?: makes it a non-capturing group, so you don't get that in your second cell rather than what you want).
And Google seems to automatically fill =CONTINUE(..., 1, 2) into the next cell, which gives you the two groups next to one another.
The pursuit to make the .'s optional
At first I tried just saying \.?, but obviously the [^\s]+ will then consume the . (which is not desired).
So you need to include something inside the brackets to prevent this. Specifically, you need to check the last character and make sure it's not a ..
This led me to:
"doi: ([^\s]*[^.\s])\.?(?: ([^:]*[^.:])\.?)?"
This allows for optional .'s, but if there are more than 1 . at the end, it won't work. Assuming we want none of these in our output, it's easily fixed by changing the \.? to \.*.
"doi: ([^\s]*[^.\s])\.*(?: ([^:]*[^.:])\.*)?"
I believe this may do the trick:
/doi: ((\S+)(?:\. .+)?)\.$/
The outermost group (which captures the longer string) is capture group 1, and the innermost group is capture group 2.
=REGEXEXTRACT(cell;"doi: ([.\d]+\/[\w\.]+)\.(?: |$)")
--> it extracts 10.1039/c1ob05128h
No need to combine regular expressions, it can be done at once.
I tried it on all of your examples and it works.