Regular expression: "between" : and a space - regex

I know there are a load of similar "combine 2 regular expressions" posts on here, but I've tried the solutions and keep getting errors.
I've got the regular expressions to parse a description such as:
Org Biomol Chem. 2011 May 7;9(9):3549-59. doi: 10.1039/c1ob05128h. Epub 2011 Mar 28.
to extract the DOI (Digital Object Identifier):
([^:]+$) --> 10.1039/c1ob05128h. Epub 2011 Mar 28.
([^\s]+) --> 10.1039/c1ob05128h.
But am pretty clueless as to how to combine these. If it's difficult, then it's not necessary, but would simplify my calculations.
I also can't figure out how to get rid of that last "." which is not part of the DOI string (for the record there may be more than 2 full stops in a DOI so the regex can't simply be "after the 2nd full stop").
Some other examples as requested:
Chem Soc Rev. 2008 Nov;37(11):2413-21. doi: 10.1039/b719548f. Epub 2008 Sep 16.
Small. 2010 Dec 20;6(24):2796-820. doi: 10.1002/smll.201001881. Review.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Dec 27;16(48):14285-9. doi: 10.1002/chem.201002111. No abstract available.
All the attempts I've made so far give a result much the same as this:
Some of the exceptions to Dukeling's suggestion of "doi: ([^\s]+).? ([^:]+).?", for reasons unknown, were:
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.
Chem Commun (Camb). 2012 Dec 25;48(99):12094-6. doi: 10.1039/c2cc35588d.
Org Biomol Chem. 2013 Jan 7;11(1):27-30. doi: 10.1039/c2ob26587g.
Chem Commun (Camb). 2013 Jan 25;49(7):671-3. doi: 10.1039/c2cc37953h.
Org Lett. 2010 Oct 1;12(19):4248-51. doi: 10.1021/ol101920b.
Chemistry. 2010 Jul 26;16(28):8537-44. doi: 10.1002/chem.201000773.

If you only want the . gone, this seems to work:
"doi: ([^\s]+)\."
So we're just putting the . outside the brackets, so it doesn't get grouped with the string.
If you want to extract 10.1039/c1ob05128h and Epub 2011 Mar 28 in 2 separate strings, you can do this with groups. You can make the regex something like:
"doi: ([^\s]+)\.(?: ([^:]+)\.)?"
Given that the second part appears to be optional, we need to surround it with brackets which we mark as optional with ? (and the ?: makes it a non-capturing group, so you don't get that in your second cell rather than what you want).
And Google seems to automatically fill =CONTINUE(..., 1, 2) into the next cell, which gives you the two groups next to one another.
The pursuit to make the .'s optional
At first I tried just saying \.?, but obviously the [^\s]+ will then consume the . (which is not desired).
So you need to include something inside the brackets to prevent this. Specifically, you need to check the last character and make sure it's not a ..
This led me to:
"doi: ([^\s]*[^.\s])\.?(?: ([^:]*[^.:])\.?)?"
This allows for optional .'s, but if there are more than 1 . at the end, it won't work. Assuming we want none of these in our output, it's easily fixed by changing the \.? to \.*.
"doi: ([^\s]*[^.\s])\.*(?: ([^:]*[^.:])\.*)?"

I believe this may do the trick:
/doi: ((\S+)(?:\. .+)?)\.$/
The outermost group (which captures the longer string) is capture group 1, and the innermost group is capture group 2.

=REGEXEXTRACT(cell;"doi: ([.\d]+\/[\w\.]+)\.(?: |$)")
--> it extracts 10.1039/c1ob05128h
No need to combine regular expressions, it can be done at once.
I tried it on all of your examples and it works.

Related

Regular Expression that matches a set of characters or nothing

I have a set of text coming in as:
1 240R 15 Apr 2021 240 Litre Sulo Bin Recycling - Dkt#11053610 O/N: ONE DENTAL $10.00.
1 1.5cm 06 Apr 2021 1.5m Co-Mingle Recycling Bin - Dkt#11028471 $18.00.
1 1.5m 12 Apr 2021 Service 1.5m Front Lift bin - Dkt#11028421 $24.00
1 660 14 Apr 2021 660L Rear Lift Bin - Dkt#11156377 O/N: YOUR CAR SOLD $22.50
I am trying the regex: PCRE(PHP<7.3)
^(\d+) [^\$]+ (\d+ \w+ \d+) (\d.+ \w.+) \S (Dkt\S\d+)[^\$]+ (\$.*)$
Which parses the above However, I am unable to parse "O/N: ONE DENTAL" or "O/N: YOUR CAR SOLD" from the above list
How can I do it that if there is something after Dkt#xxxx it will also be parsed or nothing
since you already have the pattern, I made a few slight modifications to it. This is based on the example strings provided and may need further modification if you have other variations of the rules used to match.
Example:
^(\d) (\S+) (\d{2} \w{3} \d{4}) (.*)\s+-\s(Dkt\S+)\s*(.*?)\s*(\$\d+(?:\.\d+)?)\.?\s*$

Is there a way to make this Regex expression less greedy. Using Excel VBA

I have an excel document with a cell with the following information:
A) Current to: Notice of 19 June 2014 Sent on: August 2012
B) Updated on: October 2018
C) Updated: 14 January 2009
I am using the following regular expression with some success:
(\b\d{1,2}\D{0,3})?\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?))\D?(\d{1,2}\D?)?\D?((19[7-9]\d|20\d{2})|\d{2})
the output I have is as follows:
A) 19 June2014August2012
B) October 2018
C) 14 January 2009
B and C extract fine, however I am expecting 19 June 2014 for A
I have tried adding .*? to make the expression less greedy, but (depending where I add the dot-star), I either get no result or I get an inaccurate response
To match the first match on a line you may add ^.*? at the beginning of the pattern, wrap what you have with capturing parentheses and set the Multiline regex property to True. Your match is inside match.Submatches(0).
regEx.Pattern = "^.*?((?:\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}\D?)?\D?(?:(?:19[7-9]\d|20\d{2})|\d{2}))"
regEx.Multiline = True
See the regex demo.

Regex for capturing different date formats

I'm tasked to capture date for itineraries in email message, but the dates given were all in different formats, I guess I need help to find out if there's any way to capture the following formats:
02 APR
APR 02
2 APR
APR 2
2nd APR
APR 2nd
2nd April
April 2nd
APR 12th
April 12th
12th April
April 13-16
13-16 April
APR 13-16
13-16 APR
April 13th-16th
13th-16th April
APR 13th-16th
13th-16th APR
I've tried numerous ways but just could not understand or fathom as I'm a
newbie to regex.
The closest I could get was using this:
(\d*)-(\d*) APR|April \d*\d*
EDIT- Found out that i`ve missed some more formats.
13th - 16th APR
13~16 April
13/16 APR
I`ve tried using the following:
(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?: * \d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?: . \d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
Could either capture dates with space or without space.
Is there a way to capture all formats, and split the dates with '-', '/','~' and output/write into a single standardize format?
(Group 1 Date)-Month (Group 2 Date)-Month eg: 13-Apr 16-Apr
Appreciate for your kind suggestions and comments.
You need to account for optional values. Here is an enhanced version matching your sample input:
/(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)?|Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?/i
See the regex demo (note you need to use a case-insensitive modifier to match any variants of April)
Basically, there are 2 alternatives matching April and date ranges:
(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)? - 1+ digits followed with an optional st, nd, rd, th, followed with an optional hyphen, followed with 0+ digits, followed with optional st, etc. followed with 0+ whitespace and then Apr or April (case insensitive due to /i modifier)
| - or
Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)? - the same as above but swapped.
I came up with this Regex:
(?:APR|April)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:APR|April)
See details here: Regex101
Maybe it's overkill, but I came up with this regex that will match with any month:
(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)
Unreadable, check here if you want details: Regex101
Improved version using Wiktor Stribiżew's trick:
(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
See details here: Regex101
It matches every monthes, it uses less steps (more efficient)
BUT, you need to make sure you're case insensitive
I came up with this:
(\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?\s*(?:APR|April))|((?:APR|April)\s*\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?)
Live Demo

Regex expression matches without specifying comma (,) in regex

I am taking help of this website to learn regex and I am stuck at this particular lesson. Looks like regex is wrong there.
When I write (\w+\s\d+)((\,\d+)?) "text" and "capture" goes green but "result" appears wrong (cross marks).
But if Write (\w+ (\d+)) it gives below result.
your task text capture result
capture text Jan 1987 Jan 1987, 1987 ✓
capture text May 1969 May 1969, 1969 ✓
capture text Aug 2011 Aug 2011, 2011 ✓
Now, question is (\w+ (\d+)) doesn't show that it going to capture comma but is right answer.And, in this (\w+\s\d+)((\,\d+)?) expression I have specified but it is coming wrong, why?
That's because the capture column tells you, what you should capture. For example: Jan 1987, 1987 means you should capture two groups. 1) Jan 1987 2) 1987
They use the comma as divider between the groups. So it's not part of the string you should capture, but just a divider to tell you where the next excepted capture group starts.
If you step to the next lesson http://regexone.com/lesson/13 my example will be much more clear. In the text column there isn't any comma (e.g. 1280x720) but in capture column you're asked for "1280, 720". So this props my theory.

Regex to extract non standard date format

I need a regex to grab the date section (i.e 23 March 2014) from a string with the following format:
23 March 2014 and includes 50,000 67 A Part 500
I have tried the below but it doesnt work, any ideas?
preg_match("/[0-9]+[ ][a-zA-Z][ ][201][0-9]$/",$str,$matchx);
I am using PHP
you need to have the right quantifiers; don't use a class for the string 201 or spaces (although the former is more troublesome than the latter), and you have to allow for things that come after the date (so get rid of the $):
preg_match("/[1-9][0-9]? [A-Z][a-zA-Z]+ 201[0-9]/",$str,$matchx);
By the way - the form I chose does not allow for 09 March - it won't let you have a leading zero, and it expects the first letter of the month to be capitalized. You can easily change that back of course.
Test:
<?php
$str="23 March 2014 and includes 50,000 67 A Part 500";
preg_match("/[1-9][0-9]? [A-Z][a-zA-Z]+ 201[0-9]/",$str,$matchx);
print_r($matchx);
?>
Result:
Array
(
[0] => 23 March 2014
)