Matching possible date elements in range - regex

I'm having difficulty matching other cases for a date range. The end goal will be to extract each group to build an ISO 8601 date format.
Test cases
May 8th – 14th, 2019
November 25th – December 2nd
November 5th, 2018 – January 13th, 2019
September 17th – 23rd
Regex
(\w{3,9})\s([1-9]|[12]\d|3[01])(?:st|nd|rd|th),\s(19|20)\d{2}\s–\s(\w{3,9})\s([1-9]|[12]\d|3[01])(?:st|nd|rd|th),\s(19|20)\d{2}
regexr
I would like to be able to capture each group regardless if it exists or not.
For example, May 8th – 14th, 2019
Group 1 May
Group 2 8th
Group 3
Group 4
Group 5 14th
Group 6 2019
And November 5th, 2018 – January 13th, 2019
Group 1 November
Group 2 5th
Group 3 2018
Group 4 January
Group 5 13th
Group 6 2019

To capture the empty string if the group doesn't match otherwise, the general idea is to use (<characters to match>|)
Try this one:
([A-z]{3,9})\s((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))(?:, (?=19|20))?(\d{4}|)\s–\s([A-z]{3,9}|)\s?((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))(?:, (?=19|20))?(\d{4}|)
https://regex101.com/r/4UY0WE/1/
When trying to capture the month (the first group), make sure to use [A-z]{3,9} rather than \w{3,9}, otherwise you might match, eg, 23rd rather than a month string.
Separated out:
([A-z]{3,9}) # Month ("January")
\s
((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th)) # Day of month, including suffix ("23rd")
(?:, (?=19|20))? # Comma and space, if followed by year
(\d{4}|) # Year
\s–\s #
([A-z]{3,9}|) # same as first line
\s?
# same as third to fifth lines:
((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))
(?:, (?=19|20))?
(\d{4}|)

This one saves some space by consolidating some of the groupings.
Try it here
Full regex:
([A-z]{3,9}) ((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))(?:, ((?:19|20)\d{2}))? [–-] ([A-z]{3,9}\s)?((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))(?:, ((?:19|20)\d{2}))?
Separated by group (spaces replaced by \s for readability):
1. ([A-z]{3,9})
\s
2. ((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))
3. (?:,\s((?:19|20)\d{2}))?
\s[–-]\s
4. ([A-z]{3,9}\s)?
5. ((?:[1-9]|[12]\d|3[01])(?:st|nd|rd|th))
6. (?:,\s((?:19|20)\d{2}))?
This method does not use lookups so is generally safe for any regex engine.

Related

Regex to match string with optional section preceded by space

I need help figuring out how to match a string that will have optional sections. The issue is that the optional section will be prefixed by a space and < and suffixed by >.
This is what I have currently
^([\\w,:\\s.]+)\\s-\\s<([A-Z]+)>\\s<([\\s\\w-]+)>\\s<([\\w-]+)>(\\s<[\\d.]+>)?(\\s<[\\d]+>)?\\s<([\\w.]+)>\\s-\\s<(.+?(?=>$|$))
This is the string I am trying to match
Jul 09, 2022 03:05:12.570 AM - <DEBUG> <Default Executor-thread-26> <logging-poc> <100.99.88.1> <123456> <myco> - <Inside getDebugLog()>
The section (\\s<[\\d.]+>)? and (\\s<[\\d]+>)? correspond to the ip address and account. As it is currently, the match ends up including the space and < >.
Record map{logdatetime=Jul 09, 2022 03:05:12.570 AM, severity=DEBUG, thread=Default Executor-thread-26, application=logging-poc, ip= <100.99.88.1>, account= <123456>, module=myco, message=Inside getDebugLog()>}
I only want the value for the ip and the account like below
Record map{logdatetime=Jul 09, 2022 03:05:12.570 AM, severity=DEBUG, thread=Default Executor-thread-26, application=logging-poc, ip=100.99.88.1, account=123456, module=myco, message=Inside getDebugLog()>}
The pattern should also work for this line of log (where the optional sections have been removed)
Jul 09, 2022 03:05:12.570 AM - <DEBUG> <Default Executor-thread-26> <logging-poc> <myco> - <Inside getDebugLog()>
Thank you.
You have an issue in your 5th and 6th capturing groups.
5th group: (\s<[\d.]+>)?
6th group: (\s<[\d]+>)?
These will capture the surrounding spaces and </>. The solution is to capture the information you desire in a subgroup:
5th (and now 6th) group: (\s<([\d.]+)>)?
The now-7th (and 8th) group: (\s<([\d]+)>)?
The old group 7 is now group 9, and the old group 8 is now group 10.
Per your comments below, since you need to keep the same group indices, you can make use of a non-capturing group, using the syntax (?:pattern).
5th group (inner parenthesis): (?:\s<([\d.]+)>)?
6th group (inner parenthesis): (?:\s<([\d]+)>)?
The end-result regex (using Java escapes):
^([\\w,:\\s.]+)\\s-\\s<([A-Z]+)>\\s<([\\s\\w-]+)>\\s<([\\w-]+)>\\s(?:<([\\d.]+)>)?(?:\\s<([\\d]+)>)?\\s<([\\w.]+)>\\s-\\s<(.+?(?=>$|$))
This will not change the indices of the groups you have captured, and will allow you to capture only the parts that you are interested in, while still being able to denote the outer group as optional (via ()?).

How do I capture the Months with proper Regex?

How do I capture the days of months as numbers, excluding any suffixes. For instance - January 11th would be 11, and March 25th would be 25.
You could use the regex string and then only use the 3rd capturing group.
We accept 3 letter months Jan 1st and full name January 1st and accept space, hyphen,comma or slash as in Jan 01 Jan-01 Jan,1st Jan/31
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(uary|ruary|ch|il|e|y|ust|tember|ember)?[ \/,-]{0,2}([0-3]?[0-9])
You would do better to look for native time manipulation if possible.

REGEXP_EXTRACT specific string to extract year or month in Google Data Studio

I'm trying to categorize my sites but they have not always the same uri-structure so I want to extract the year in one column and in the second one I want to extract the month.
The results should be year and months in seperate columns/fields:
url
year
months
/www.site.com/path1/resort/2021/02/sitename
2021
02
/www.site.com/path1/2021/02
2021
02
/www.site.com/path1/2020/11-12
2020
11-12
/www.site.com/path1/2020/07-08
2020
07-08
/www.site.com/path1/resort/
null
null
the following regex for the year worked:
REGEXP_EXTRACT(url,'([0-9]{4})') >> result: 2020, null etc.
but the regex for the month didnt extract only the months:
REGEXP_EXTRACT(url,'((?:[0-9]{4}/)[0-9]+.?[0-9]*/)') >> result: 2020/11-12/,2021/02/, null etc.
Thanks for the help in advance.
You can use
(?:^|/)((?:19|20)[0-9]{2})/((?:0?[1-9]|1[0-2])(?:-(?:0?[1-9]|1[0-2]))?)(?:/|$)
See the regex demo.
If you need to capture only once per a match, replace the capturing group with non-capturing, or remove the extra pattern:
REGEXP_EXTRACT(col_url, '(?:^|/)((?:19|20)[0-9]{2})(?:/|$)') as Year
REGEXP_EXTRACT(col_url, '(?:^|/)((?:0?[1-9]|1[0-2])(?:-(?:0?[1-9]|1[0-2]))?)(?:/|$)') as Month
Details:
(?:^|/) - string start or /
((?:19|20)[0-9]{2}) - Group 1: a year, 19 or 20 followed with any two digits
/ - a / char
((?:0?[1-9]|1[0-2])(?:-(?:0?[1-9]|1[0-2]))?) - Group 2 (month): an optional 0 and then 1 to 9, or 1 and then 0 to 2 (00-12), and then an optional occurrence of - and the same month pattern
(?:/|$) - / or end of string.

Complex Regex finding date and time

Is there someone to help me with the following:
I'm trying to find specific date and time strings in a text (to be used within VBA Word).
Currently working with the following RegEx string:
(?:([0-9]{1,2})[ |-])?(?:(jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|jun(?:i)?|jul(?:i)?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?))?(?: |-)?(?(3)(?: around | at | ))?(?:([0-9]{1,2}:[0-9]{1,2})?(?: uur| u|u)?)?
Tested output on following text:
date with around time: 26 sep 2016 around 09:00u
date with at time: 1 sep 2016 at 09:00 uur
date and time u: 1 sep 2018 09:00 u
time without date: 08:30 uur
date with time u: 1 sep 2016 at 09:00u
only time: 09:00
only month: jan
month and year: feb 2019
only day: 02
only day with '-': 2-
day and month: 2 jan
month year: jan 2018
date with '-': 2-feb-2018 09:00
other month: 01 sept 2016
full month: 1 september 2018
shortened year: jul '18
Rules:
a date followed by time is valid
a date followed by text 'around' or 'at', followed by time is valid
a date without day number is valid
a date without year is valid
a date, month only is not valid
a day, without month or year not valid
a date may contain dashes '-'
a year may be shortenend with ', like jun '18
month name can be short or long
full match includes ' uur' or 'u' (to highlight the text in ms-Word)
submatches text from capture are without prepending or trailing spaces
example at: [https://regex101.com/r/6CFgBP/1/]
Expected output (when using in VBA Word):
An regex Matches collection object in which each Match.SubMatches contains the individual items d, m, y, hh:mm from the capture groups in the regex search string.
So for example 1: the Submatches (or capture groups) contains values: '26' ','sep','2016','09:00'
The RegEx works fine, but some false-positives need to be excluded:
In case there is a day without month/year, should be excluded from Regex (example 9 and 10)
In case there is a month without day, should be excluded (example 7)
(I was trying with som lookahead and reference \1 and ?(1), but was not able to get it running properly...)
Any advice highly appreciated!
As I understood, you require that each date/time part (day, month, year, hour
and minute) must be present.
So you should remove ? after relevant groups (they are not optional).
It is also a good practice to have each group captured as a relevant capturing group.
There is no need to write something like jun(?:i)?. It is enough
(and easier to read) when you write just juni? (the ? refers just
to preceding i).
Another hint: As the regex language contains \d char class, use just
it instead of [0-9] (the regex is shorter and easier to read.
Optional parts (at / around) should be an optional and non-capturing group.
Anything after the minute part is not needed in the regex.
So I propose a regex like below (for readability, I divided it into rows):
(\d{1,2})[ -](jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|juni?
|juli?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?)
[ -](\d{4}) (?:around |at )?(\d{1,2}:\d{1,2})
Details:
(\d{1,2}) - Day.
[ -] - A separator after the day (either a space or a minus).
(jan(?:uari)?|...dec(?:ember)?) - Month.
[ -] - A separator after the month.
(\d{4}) - year.
(?:around |at )? - Actually, 3 variants of a separator between year
and hour (space / around / at), note the space before (...)?.
(\d{1,2}:\d{1,2}) - Hour and minute.
It matches variants 1, 2, 3, 5 and 13.
All remaining fail to contain each required part, so they are not matched.
If you allow e.g. that the hour/minute part is optional, change the respective fragment
into:
( (?:around |at )?(\d{1,2}:\d{1,2}))?
i.e. surround the space/around/at / hour / minute part with ( and )?,
making this part an optional group. Then, variants 14 and 15 will also
be matched.
One more extension: If you also allow the hour/minute part alone,
add |(\d{1,2}:\d{1,2}) to the regex (all before is the first variant and
the added part is the second variant for just hour/minute.
Then, your variants No 4 and 6 will also be matched.
For a working example see https://regex101.com/r/33t1ps/1
Edit
Following your list of rules, I propose the following regex:
(\d{1,2}[ -])? - Day + separator, optional.
(jan(?:uari)?|...|dec(?:ember)?) - Month.
(?:[ -](\d{4}|'\d{2}))? - Separator + year (either 4 or 2 digits with "'").
( (?:around |at )?(\d{1,2}:\d{1,2}))? - Separator + hour/minute -
optional end of variant 1.
|(\d{1,2}:\d{1,2}) - Variant 2 - only hour and minute.
It does not match only your variants No 9 and 10.
For full regex, including also "uur" see https://regex101.com/r/33t1ps/3
Finally I found something that helps me using the month properly :-)
\b(?:([1-3]|[0-3]\d)[ |-](?'month'(?:[1-9]|\d[12])|(?:jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|jun(?:i)?|jul(?:i)?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?))?)?(?:(\g'month')[ |-]((?:19|20|\')(?:\d{2})))?\b(?: omstreeks | om | )?(?:(\d{1,2}[:]\d{2}(?: uur|u)?|[0-2]\d{3}(?: uur|u)))?\b
It uses a named constructor/subroutine. Found here:
https://www.regular-expressions.info/subroutine.html

Regex for capturing different date formats

I'm tasked to capture date for itineraries in email message, but the dates given were all in different formats, I guess I need help to find out if there's any way to capture the following formats:
02 APR
APR 02
2 APR
APR 2
2nd APR
APR 2nd
2nd April
April 2nd
APR 12th
April 12th
12th April
April 13-16
13-16 April
APR 13-16
13-16 APR
April 13th-16th
13th-16th April
APR 13th-16th
13th-16th APR
I've tried numerous ways but just could not understand or fathom as I'm a
newbie to regex.
The closest I could get was using this:
(\d*)-(\d*) APR|April \d*\d*
EDIT- Found out that i`ve missed some more formats.
13th - 16th APR
13~16 April
13/16 APR
I`ve tried using the following:
(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?: * \d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?: . \d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
Could either capture dates with space or without space.
Is there a way to capture all formats, and split the dates with '-', '/','~' and output/write into a single standardize format?
(Group 1 Date)-Month (Group 2 Date)-Month eg: 13-Apr 16-Apr
Appreciate for your kind suggestions and comments.
You need to account for optional values. Here is an enhanced version matching your sample input:
/(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)?|Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?/i
See the regex demo (note you need to use a case-insensitive modifier to match any variants of April)
Basically, there are 2 alternatives matching April and date ranges:
(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)? - 1+ digits followed with an optional st, nd, rd, th, followed with an optional hyphen, followed with 0+ digits, followed with optional st, etc. followed with 0+ whitespace and then Apr or April (case insensitive due to /i modifier)
| - or
Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)? - the same as above but swapped.
I came up with this Regex:
(?:APR|April)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:APR|April)
See details here: Regex101
Maybe it's overkill, but I came up with this regex that will match with any month:
(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)
Unreadable, check here if you want details: Regex101
Improved version using Wiktor Stribiżew's trick:
(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
See details here: Regex101
It matches every monthes, it uses less steps (more efficient)
BUT, you need to make sure you're case insensitive
I came up with this:
(\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?\s*(?:APR|April))|((?:APR|April)\s*\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?)
Live Demo