Regex greedy match failed? - regex

I have the below 2 lines out of other lines that I need to match.
2015 - 2016 2016 - 2017 2017 - 2018 2018 - 2019 2019 - 2020 2020
2015 - 2016 2016 - 2017 2017 - 2018 2018 - 2019 2019 - 2020 APR 2021
Top is to be able to match and the get the year value '2020' and bottom is match and get the month and the year value 'APR 2021'.
I'm able to match both using this regex (^\d.*(?<month>[a-zA-Z]{3})?\s(?<year>\d{4})$) to match these 2 lines but for the bottom matched line, it does not give me the month value.
Any idea?
This is the output from
regex101.com

If all characters that aren't in month names are all going to be digits, dashes and spaces, explicitly match those (e.g: [0-9\- ]*) rather than anything (e.g : .*) :
# 'anything goes' matched here
# v
^\d[0-9 \-]*(?<month>[a-zA-Z]{3})?\s(?<year>\d{4})$

Related

Regular Expression that matches a set of characters or nothing

I have a set of text coming in as:
1 240R 15 Apr 2021 240 Litre Sulo Bin Recycling - Dkt#11053610 O/N: ONE DENTAL $10.00.
1 1.5cm 06 Apr 2021 1.5m Co-Mingle Recycling Bin - Dkt#11028471 $18.00.
1 1.5m 12 Apr 2021 Service 1.5m Front Lift bin - Dkt#11028421 $24.00
1 660 14 Apr 2021 660L Rear Lift Bin - Dkt#11156377 O/N: YOUR CAR SOLD $22.50
I am trying the regex: PCRE(PHP<7.3)
^(\d+) [^\$]+ (\d+ \w+ \d+) (\d.+ \w.+) \S (Dkt\S\d+)[^\$]+ (\$.*)$
Which parses the above However, I am unable to parse "O/N: ONE DENTAL" or "O/N: YOUR CAR SOLD" from the above list
How can I do it that if there is something after Dkt#xxxx it will also be parsed or nothing
since you already have the pattern, I made a few slight modifications to it. This is based on the example strings provided and may need further modification if you have other variations of the rules used to match.
Example:
^(\d) (\S+) (\d{2} \w{3} \d{4}) (.*)\s+-\s(Dkt\S+)\s*(.*?)\s*(\$\d+(?:\.\d+)?)\.?\s*$

Is there a way to make this Regex expression less greedy. Using Excel VBA

I have an excel document with a cell with the following information:
A) Current to: Notice of 19 June 2014 Sent on: August 2012
B) Updated on: October 2018
C) Updated: 14 January 2009
I am using the following regular expression with some success:
(\b\d{1,2}\D{0,3})?\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?))\D?(\d{1,2}\D?)?\D?((19[7-9]\d|20\d{2})|\d{2})
the output I have is as follows:
A) 19 June2014August2012
B) October 2018
C) 14 January 2009
B and C extract fine, however I am expecting 19 June 2014 for A
I have tried adding .*? to make the expression less greedy, but (depending where I add the dot-star), I either get no result or I get an inaccurate response
To match the first match on a line you may add ^.*? at the beginning of the pattern, wrap what you have with capturing parentheses and set the Multiline regex property to True. Your match is inside match.Submatches(0).
regEx.Pattern = "^.*?((?:\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}\D?)?\D?(?:(?:19[7-9]\d|20\d{2})|\d{2}))"
regEx.Multiline = True
See the regex demo.

Regular expression that returns two dates

I'm struggling in creating a regex that given this string:
20 - 29 APR 2017, 9 nights
returns two groups, first:
20 APR 2017
and second:
29 APR 2017
You can try this regex:
(\d+)\s*-\s*(\d+)\s*([a-zA-Z]+)\s*(\d+).*
And replace by:
$1 $3 $4\n$2 $3 $4
Regex Demo

Regex for capturing different date formats

I'm tasked to capture date for itineraries in email message, but the dates given were all in different formats, I guess I need help to find out if there's any way to capture the following formats:
02 APR
APR 02
2 APR
APR 2
2nd APR
APR 2nd
2nd April
April 2nd
APR 12th
April 12th
12th April
April 13-16
13-16 April
APR 13-16
13-16 APR
April 13th-16th
13th-16th April
APR 13th-16th
13th-16th APR
I've tried numerous ways but just could not understand or fathom as I'm a
newbie to regex.
The closest I could get was using this:
(\d*)-(\d*) APR|April \d*\d*
EDIT- Found out that i`ve missed some more formats.
13th - 16th APR
13~16 April
13/16 APR
I`ve tried using the following:
(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?: * \d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?: . \d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
Could either capture dates with space or without space.
Is there a way to capture all formats, and split the dates with '-', '/','~' and output/write into a single standardize format?
(Group 1 Date)-Month (Group 2 Date)-Month eg: 13-Apr 16-Apr
Appreciate for your kind suggestions and comments.
You need to account for optional values. Here is an enhanced version matching your sample input:
/(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)?|Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?/i
See the regex demo (note you need to use a case-insensitive modifier to match any variants of April)
Basically, there are 2 alternatives matching April and date ranges:
(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)? - 1+ digits followed with an optional st, nd, rd, th, followed with an optional hyphen, followed with 0+ digits, followed with optional st, etc. followed with 0+ whitespace and then Apr or April (case insensitive due to /i modifier)
| - or
Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)? - the same as above but swapped.
I came up with this Regex:
(?:APR|April)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:APR|April)
See details here: Regex101
Maybe it's overkill, but I came up with this regex that will match with any month:
(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)
Unreadable, check here if you want details: Regex101
Improved version using Wiktor Stribiżew's trick:
(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
See details here: Regex101
It matches every monthes, it uses less steps (more efficient)
BUT, you need to make sure you're case insensitive
I came up with this:
(\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?\s*(?:APR|April))|((?:APR|April)\s*\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?)
Live Demo

Regular expression: "between" : and a space

I know there are a load of similar "combine 2 regular expressions" posts on here, but I've tried the solutions and keep getting errors.
I've got the regular expressions to parse a description such as:
Org Biomol Chem. 2011 May 7;9(9):3549-59. doi: 10.1039/c1ob05128h. Epub 2011 Mar 28.
to extract the DOI (Digital Object Identifier):
([^:]+$) --> 10.1039/c1ob05128h. Epub 2011 Mar 28.
([^\s]+) --> 10.1039/c1ob05128h.
But am pretty clueless as to how to combine these. If it's difficult, then it's not necessary, but would simplify my calculations.
I also can't figure out how to get rid of that last "." which is not part of the DOI string (for the record there may be more than 2 full stops in a DOI so the regex can't simply be "after the 2nd full stop").
Some other examples as requested:
Chem Soc Rev. 2008 Nov;37(11):2413-21. doi: 10.1039/b719548f. Epub 2008 Sep 16.
Small. 2010 Dec 20;6(24):2796-820. doi: 10.1002/smll.201001881. Review.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Dec 27;16(48):14285-9. doi: 10.1002/chem.201002111. No abstract available.
All the attempts I've made so far give a result much the same as this:
Some of the exceptions to Dukeling's suggestion of "doi: ([^\s]+).? ([^:]+).?", for reasons unknown, were:
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
If you only want the . gone, this seems to work:
"doi: ([^\s]+)\."
So we're just putting the . outside the brackets, so it doesn't get grouped with the string.
If you want to extract 10.1039/c1ob05128h and Epub 2011 Mar 28 in 2 separate strings, you can do this with groups. You can make the regex something like:
"doi: ([^\s]+)\.(?: ([^:]+)\.)?"
Given that the second part appears to be optional, we need to surround it with brackets which we mark as optional with ? (and the ?: makes it a non-capturing group, so you don't get that in your second cell rather than what you want).
And Google seems to automatically fill =CONTINUE(..., 1, 2) into the next cell, which gives you the two groups next to one another.
The pursuit to make the .'s optional
At first I tried just saying \.?, but obviously the [^\s]+ will then consume the . (which is not desired).
So you need to include something inside the brackets to prevent this. Specifically, you need to check the last character and make sure it's not a ..
This led me to:
"doi: ([^\s]*[^.\s])\.?(?: ([^:]*[^.:])\.?)?"
This allows for optional .'s, but if there are more than 1 . at the end, it won't work. Assuming we want none of these in our output, it's easily fixed by changing the \.? to \.*.
"doi: ([^\s]*[^.\s])\.*(?: ([^:]*[^.:])\.*)?"
I believe this may do the trick:
/doi: ((\S+)(?:\. .+)?)\.$/
The outermost group (which captures the longer string) is capture group 1, and the innermost group is capture group 2.
=REGEXEXTRACT(cell;"doi: ([.\d]+\/[\w\.]+)\.(?: |$)")
--> it extracts 10.1039/c1ob05128h
No need to combine regular expressions, it can be done at once.
I tried it on all of your examples and it works.