Regex to extract non standard date format

Regex to extract non standard date format - regex

I need a regex to grab the date section (i.e 23 March 2014) from a string with the following format:
23 March 2014 and includes 50,000 67 A Part 500
I have tried the below but it doesnt work, any ideas?
preg_match("/[0-9]+[ ][a-zA-Z][ ][201][0-9]$/",$str,$matchx);
I am using PHP

you need to have the right quantifiers; don't use a class for the string 201 or spaces (although the former is more troublesome than the latter), and you have to allow for things that come after the date (so get rid of the $):
preg_match("/[1-9][0-9]? [A-Z][a-zA-Z]+ 201[0-9]/",$str,$matchx);
By the way - the form I chose does not allow for 09 March - it won't let you have a leading zero, and it expects the first letter of the month to be capitalized. You can easily change that back of course.
Test:
<?php
$str="23 March 2014 and includes 50,000 67 A Part 500";
preg_match("/[1-9][0-9]? [A-Z][a-zA-Z]+ 201[0-9]/",$str,$matchx);
print_r($matchx);
?>
Result:
Array
(
[0] => 23 March 2014
)

Related

Regular Expression that matches a set of characters or nothing

I have a set of text coming in as:
1 240R 15 Apr 2021 240 Litre Sulo Bin Recycling - Dkt#11053610 O/N: ONE DENTAL $10.00.
1 1.5cm 06 Apr 2021 1.5m Co-Mingle Recycling Bin - Dkt#11028471 $18.00.
1 1.5m 12 Apr 2021 Service 1.5m Front Lift bin - Dkt#11028421 $24.00
1 660 14 Apr 2021 660L Rear Lift Bin - Dkt#11156377 O/N: YOUR CAR SOLD $22.50
I am trying the regex: PCRE(PHP<7.3)
^(\d+) [^\$]+ (\d+ \w+ \d+) (\d.+ \w.+) \S (Dkt\S\d+)[^\$]+ (\$.*)$
Which parses the above However, I am unable to parse "O/N: ONE DENTAL" or "O/N: YOUR CAR SOLD" from the above list
How can I do it that if there is something after Dkt#xxxx it will also be parsed or nothing

since you already have the pattern, I made a few slight modifications to it. This is based on the example strings provided and may need further modification if you have other variations of the rules used to match.
Example:
^(\d) (\S+) (\d{2} \w{3} \d{4}) (.*)\s+-\s(Dkt\S+)\s*(.*?)\s*(\$\d+(?:\.\d+)?)\.?\s*$

Is there a way to make this Regex expression less greedy. Using Excel VBA

I have an excel document with a cell with the following information:
A) Current to: Notice of 19 June 2014 Sent on: August 2012
B) Updated on: October 2018
C) Updated: 14 January 2009
I am using the following regular expression with some success:
(\b\d{1,2}\D{0,3})?\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?))\D?(\d{1,2}\D?)?\D?((19[7-9]\d|20\d{2})|\d{2})
the output I have is as follows:
A) 19 June2014August2012
B) October 2018
C) 14 January 2009
B and C extract fine, however I am expecting 19 June 2014 for A
I have tried adding .*? to make the expression less greedy, but (depending where I add the dot-star), I either get no result or I get an inaccurate response

To match the first match on a line you may add ^.*? at the beginning of the pattern, wrap what you have with capturing parentheses and set the Multiline regex property to True. Your match is inside match.Submatches(0).
regEx.Pattern = "^.*?((?:\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}\D?)?\D?(?:(?:19[7-9]\d|20\d{2})|\d{2}))"
regEx.Multiline = True
See the regex demo.

Regex for capturing different date formats

I'm tasked to capture date for itineraries in email message, but the dates given were all in different formats, I guess I need help to find out if there's any way to capture the following formats:
02 APR
APR 02
2 APR
APR 2
2nd APR
APR 2nd
2nd April
April 2nd
APR 12th
April 12th
12th April
April 13-16
13-16 April
APR 13-16
13-16 APR
April 13th-16th
13th-16th April
APR 13th-16th
13th-16th APR
I've tried numerous ways but just could not understand or fathom as I'm a
newbie to regex.
The closest I could get was using this:
(\d*)-(\d*) APR|April \d*\d*
EDIT- Found out that i`ve missed some more formats.
13th - 16th APR
13~16 April
13/16 APR
I`ve tried using the following:
(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?: * \d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?: . \d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
Could either capture dates with space or without space.
Is there a way to capture all formats, and split the dates with '-', '/','~' and output/write into a single standardize format?
(Group 1 Date)-Month (Group 2 Date)-Month eg: 13-Apr 16-Apr
Appreciate for your kind suggestions and comments.

You need to account for optional values. Here is an enhanced version matching your sample input:
/(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)?|Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?/i
See the regex demo (note you need to use a case-insensitive modifier to match any variants of April)
Basically, there are 2 alternatives matching April and date ranges:
(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)? - 1+ digits followed with an optional st, nd, rd, th, followed with an optional hyphen, followed with 0+ digits, followed with optional st, etc. followed with 0+ whitespace and then Apr or April (case insensitive due to /i modifier)
| - or
Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)? - the same as above but swapped.

I came up with this Regex:
(?:APR|April)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:APR|April)
See details here: Regex101
Maybe it's overkill, but I came up with this regex that will match with any month:
(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)
Unreadable, check here if you want details: Regex101
Improved version using Wiktor Stribiżew's trick:
(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
See details here: Regex101
It matches every monthes, it uses less steps (more efficient)
BUT, you need to make sure you're case insensitive

I came up with this:
(\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?\s*(?:APR|April))|((?:APR|April)\s*\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?)
Live Demo

Parsing dates from string - regex

I'm terible with regex and I can't seem to wrap my head around this simple task.
I need to parse out the two dates in a string which always has one of two formats:
"Inquiry at your property for December 29, 2013 - January 03, 2014"
OR
"Inquiry at your property for 29 December , 2013 - 03 January, 2014"
the 2 different date formats are throwing me off. Any insights would be appreciated!

/(\d+ \w+, \d+|\w+ \d+, \d+)/ for example. Try it out on Rubular.
For sure, it would pickup more stuff, like 2013 NotReallyAMonth, 12345. But if you don't have things in the input that look like a date, but not actually a date this might work.
You could make the regexp stronger, but applying more restrictions on what is matched:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4})/
In this case the day is always two digits, the year is 4. Months are listed explicitly (you would have to list all of them).
Update: For ranges it would be a different regexp:
/((?:Jan|Dec) \d+ - \d+, \d{4})/
Obviously they can all be combined together:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4}|(?:Jan|Dec) \d+ - \d+, \d{4})/

Regular expression: "between" : and a space

I know there are a load of similar "combine 2 regular expressions" posts on here, but I've tried the solutions and keep getting errors.
I've got the regular expressions to parse a description such as:
Org Biomol Chem. 2011 May 7;9(9):3549-59. doi: 10.1039/c1ob05128h. Epub 2011 Mar 28.
to extract the DOI (Digital Object Identifier):
([^:]+$) --> 10.1039/c1ob05128h. Epub 2011 Mar 28.
([^\s]+) --> 10.1039/c1ob05128h.
But am pretty clueless as to how to combine these. If it's difficult, then it's not necessary, but would simplify my calculations.
I also can't figure out how to get rid of that last "." which is not part of the DOI string (for the record there may be more than 2 full stops in a DOI so the regex can't simply be "after the 2nd full stop").
Some other examples as requested:
Chem Soc Rev. 2008 Nov;37(11):2413-21. doi: 10.1039/b719548f. Epub 2008 Sep 16.
Small. 2010 Dec 20;6(24):2796-820. doi: 10.1002/smll.201001881. Review.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Dec 27;16(48):14285-9. doi: 10.1002/chem.201002111. No abstract available.
All the attempts I've made so far give a result much the same as this:
Some of the exceptions to Dukeling's suggestion of "doi: ([^\s]+).? ([^:]+).?", for reasons unknown, were:
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.

If you only want the . gone, this seems to work:
"doi: ([^\s]+)\."
So we're just putting the . outside the brackets, so it doesn't get grouped with the string.
If you want to extract 10.1039/c1ob05128h and Epub 2011 Mar 28 in 2 separate strings, you can do this with groups. You can make the regex something like:
"doi: ([^\s]+)\.(?: ([^:]+)\.)?"
Given that the second part appears to be optional, we need to surround it with brackets which we mark as optional with ? (and the ?: makes it a non-capturing group, so you don't get that in your second cell rather than what you want).
And Google seems to automatically fill =CONTINUE(..., 1, 2) into the next cell, which gives you the two groups next to one another.
The pursuit to make the .'s optional
At first I tried just saying \.?, but obviously the [^\s]+ will then consume the . (which is not desired).
So you need to include something inside the brackets to prevent this. Specifically, you need to check the last character and make sure it's not a ..
This led me to:
"doi: ([^\s]*[^.\s])\.?(?: ([^:]*[^.:])\.?)?"
This allows for optional .'s, but if there are more than 1 . at the end, it won't work. Assuming we want none of these in our output, it's easily fixed by changing the \.? to \.*.
"doi: ([^\s]*[^.\s])\.*(?: ([^:]*[^.:])\.*)?"

I believe this may do the trick:
/doi: ((\S+)(?:\. .+)?)\.$/
The outermost group (which captures the longer string) is capture group 1, and the innermost group is capture group 2.

=REGEXEXTRACT(cell;"doi: ([.\d]+\/[\w\.]+)\.(?: |$)")
--> it extracts 10.1039/c1ob05128h
No need to combine regular expressions, it can be done at once.
I tried it on all of your examples and it works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to extract non standard date format - regex

I need a regex to grab the date section (i.e 23 March 2014) from a string with the following format: 23 March 2014 and includes 50,000 67 A Part 500 I have tried the below but it doesnt work, any ideas? preg_match("/[0-9]+[ ][a-zA-Z][ ][201][0-9]$/",$str,$matchx); I am using PHP

Related

Regular Expression that matches a set of characters or nothing

Is there a way to make this Regex expression less greedy. Using Excel VBA

Regex for capturing different date formats

Parsing dates from string - regex

Regular expression: "between" : and a space

Categories

Resources