VBA Regex - match entire string unless repeating pattern - regex

I am vexed - and I suspect there is an easy solution to this but after a fair amount of research, I'm reaching out to the community.
I'm using the regex method in vba to try to split strings. What I want to occur is that the entire string will match the pattern unless there is another name in the string. The name can be described by:
"\s?[a-zA-Z-]*,\s[a-zA-Z]*:\s.*"
I would expect that the method would return everything after the name is matched - until another name is matched. This would be the desired outcome.
The strings I'm applying that pattern to are:
Meck, Mary: Fri 6/14/2019 5:00 PM -- 10:00 PM CLERKPETRO Flinstone, Fred: Fri 6/14/2019 10:00 AM -- 4:00 PM CLERKPETRO Powers, Kenny: Fri 6/14/2019 10:00 PM -- 11:00 PM
Rhodes, Randy: Sat 6/15/2019 10:15 AM -- 11:30 AM SERVCNTR Sat 6/15/2019 11:30 AM -- 12:45 PM CLICK AND PICK Sat 6/15/2019 12:45 PM -- 2:15 PM SERVCNTR
When I apply the pattern to either string, the entire string is returned. This is not optimal because I'm trying to split on names using matches(0), matches(1), etc.. so the first string should match on:
Meck, Mary: Fri 6/14/2019 5:00 PM -- 10:00 PM CLERKPETRO
Flinstone, Fred: Fri 6/14/2019 10:00 AM -- 4:00 PM CLERKPETRO
Powers, Kenny: Fri 6/14/2019 10:00 PM -- 11:00 PM
yet the second string should match on the entire string (as it currently does) because there is not a second name in that string.
How do I solve this problem?

This is one way to do it
\b[a-zA-Z-]+,\s?[a-zA-Z]+:.*?(?=\b[a-zA-Z-]+,\s?[a-zA-Z]+:|$)
https://regex101.com/r/ccj6ea/1
Expanded
\b
[a-zA-Z-]+
,
\s?
[a-zA-Z]+
:
.*?
(?=
\b
[a-zA-Z-]+
,
\s?
[a-zA-Z]+
:
|
$
)

RegEx 1
I'm guessing that we wish to capture three parts of the strings listed in the question, which if that might be the case, we would be starting by slightly modifying the original expression:
(?:\s+)?([a-zA-Z-]+),?(?:\s+)?([a-zA-Z]+):(.+?[A-Z]{3,}).*
where our desired outputs are in these three groups:
([a-zA-Z-]+)
([a-zA-Z]+)
(.+?[A-Z]{3,})
Demo
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 2
If we wish to split them on names, we would simplify our expression to:
(?:\s+)?([A-Z][a-zA-Z-]+),?(?:\s+)?([A-Z][a-zA-Z]+):
Demo 2

Related

Regular Expression that matches a set of characters or nothing

I have a set of text coming in as:
1 240R 15 Apr 2021 240 Litre Sulo Bin Recycling - Dkt#11053610 O/N: ONE DENTAL $10.00.
1 1.5cm 06 Apr 2021 1.5m Co-Mingle Recycling Bin - Dkt#11028471 $18.00.
1 1.5m 12 Apr 2021 Service 1.5m Front Lift bin - Dkt#11028421 $24.00
1 660 14 Apr 2021 660L Rear Lift Bin - Dkt#11156377 O/N: YOUR CAR SOLD $22.50
I am trying the regex: PCRE(PHP<7.3)
^(\d+) [^\$]+ (\d+ \w+ \d+) (\d.+ \w.+) \S (Dkt\S\d+)[^\$]+ (\$.*)$
Which parses the above However, I am unable to parse "O/N: ONE DENTAL" or "O/N: YOUR CAR SOLD" from the above list
How can I do it that if there is something after Dkt#xxxx it will also be parsed or nothing
since you already have the pattern, I made a few slight modifications to it. This is based on the example strings provided and may need further modification if you have other variations of the rules used to match.
Example:
^(\d) (\S+) (\d{2} \w{3} \d{4}) (.*)\s+-\s(Dkt\S+)\s*(.*?)\s*(\$\d+(?:\.\d+)?)\.?\s*$

Regex for capturing different date formats

I'm tasked to capture date for itineraries in email message, but the dates given were all in different formats, I guess I need help to find out if there's any way to capture the following formats:
02 APR
APR 02
2 APR
APR 2
2nd APR
APR 2nd
2nd April
April 2nd
APR 12th
April 12th
12th April
April 13-16
13-16 April
APR 13-16
13-16 APR
April 13th-16th
13th-16th April
APR 13th-16th
13th-16th APR
I've tried numerous ways but just could not understand or fathom as I'm a
newbie to regex.
The closest I could get was using this:
(\d*)-(\d*) APR|April \d*\d*
EDIT- Found out that i`ve missed some more formats.
13th - 16th APR
13~16 April
13/16 APR
I`ve tried using the following:
(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?: * \d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?: . \d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
Could either capture dates with space or without space.
Is there a way to capture all formats, and split the dates with '-', '/','~' and output/write into a single standardize format?
(Group 1 Date)-Month (Group 2 Date)-Month eg: 13-Apr 16-Apr
Appreciate for your kind suggestions and comments.
You need to account for optional values. Here is an enhanced version matching your sample input:
/(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)?|Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?/i
See the regex demo (note you need to use a case-insensitive modifier to match any variants of April)
Basically, there are 2 alternatives matching April and date ranges:
(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)?\s*Apr(?:il)? - 1+ digits followed with an optional st, nd, rd, th, followed with an optional hyphen, followed with 0+ digits, followed with optional st, etc. followed with 0+ whitespace and then Apr or April (case insensitive due to /i modifier)
| - or
Apr(?:il)?\s*(\d+)(?:st|[nr]d|th)?-?(\d*)(?:st|[nr]d|th)? - the same as above but swapped.
I came up with this Regex:
(?:APR|April)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:APR|April)
See details here: Regex101
Maybe it's overkill, but I came up with this regex that will match with any month:
(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:January|JAN|February|FEB|March|MAR|April|APR|May|MAY|June|JUN|July|JUL|August|AUG|September|SEP|October|OCT|November|NOV|December|DEC)
Unreadable, check here if you want details: Regex101
Improved version using Wiktor Stribiżew's trick:
(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\ *\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?|\d+(?:[nr]d|th|st)?(?:-\d+(?:[nr]d|th|st)?)?\ *(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)
See details here: Regex101
It matches every monthes, it uses less steps (more efficient)
BUT, you need to make sure you're case insensitive
I came up with this:
(\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?\s*(?:APR|April))|((?:APR|April)\s*\d+(?:th|st|[nr]d)?(?:-\d+(?:th|st|[nr]d)?)?)
Live Demo

Search upto a delimiter using regular expression

In the below string I need to use re.findall for '(any day)' then have it print upto the delimiter ',' prior
rr='PU3lserver1^server2|ABAP|Revisions|true|null|Weekend
only,ATN|server3|ABAP|Revisions|true|null|1:00 AM to 3:00 AM CET (any
day),B4P|server4^server5|ABAP|Revisions|true|Generic AFL|8:00 PM to 3:00 AM
CET (any day),C8B|server6|ABAP|Revisions|true|Generic AFL|8:00 PM to 3:00 AM
CET (any day),QU8|testserver|ABAP|Revisions|true|null|1:00 AM to 3:00 AM CET
(any day),S77|testserver|ABAP|Revisions|true|null|Weekend only'
works well as expected:
re.findall(r'[^\s,]+Weekend\s\bonly' ,rr, re.M)
Does not work as expected:
re.findall(r'[\s,]+\(any\s\bday\)' ,rr, re.M)
Any help or suggestion where I am going wrong.
You were almost correct
All you need to do is to change the character class to negation as
r'[^,]+\(any\s\bday\)'
[^,]+ Negated character class, matches anything other than , till any day is found
re.M can be dropped as the input is single lined
Test
>>> re.findall(r'[^,]+\(any\s\bday\)' ,rr)
['B4P|server4^server5|ABAP|Revisions|true|Generic AFL|8:00 PM to 3:00 AM \nCET (any day)',
'C8B|server6|ABAP|Revisions|true|Generic AFL|8:00 PM to 3:00 AM \nCET (any day)',
'QU8|testserver|ABAP|Revisions|true|null|1:00 AM to 3:00 AM CET \n(any day)']

Perl Date conversion with Perl's built-in functions

I have a date in the following format which is how it is stored in an external application
06/12/2014 6:31 PM IST
I want to change this format to which needs to check AM and PM also and rest of the minutes to be hardcoded as :00+5:30
2014-06-12 18:31:00+05:30
Since it is not a simple GMT to PDT change I tried with epoch, but the problem here is that the date is in different format.
you can just do it with regex and a printf. in your particular case i suggest:
my $yourstring = '06/12/2014 6:31 PM IST';
my ($mm,$dd,$yy,$hh,$mi,$ampm) = $yourstring =~ m~(\d+)/(\d+)/(\d+)\s+(\d+):(\d+)\s+(AM|PM)\s+~;
if ($ampm eq 'PM' && $hh < 12) { $hh += 12 }
printf ('%04d-%02d-%02d %02d:%02d:00+05:30', $yy, $mm, $dd, $hh, $mi);
or look into DateTime::Format::DateParse
The best thing to do is to use Perl's now built in Date/Time functions in Time::Piece. Once you convert your time into a true Perl time, you can format it any way you please.
The problem is that Time::Piece doesn't do a very good job at handling timezones. The easiest way of handling timezones on strings is to simply remove them:
use strict;
use warnings;
use Time::Piece;
# This is what you have given
my $input_time="6/12/2014 6:31 PM IDT";
# Remove timezone (the last four characters)
my $munged_time = substr $input_time, 0, -4;
# This is a description of your "format" above
# You can find this by doing a "man date" on Unix.
# Sometimes it's "man strftime".
#
# The various %x are bits that represent the time
# in various formats.
# %m = month %d = month date %Y = four digit year
# %I = hour 01-12 %M = Minutes %p = AM/PM
my $time_format="%m/%d/%Y %I:%M %p";
# Now convert the time from the format to a Time::Zone
# Date Object.
my $time = Time::Piece->strptime($munged_time, $time_format);
# Now that the time is in a Time::Piece object, we can easily
# manipulate it to do whatever we want. We could add or subtract
# date times from each other, add or subtract hours, months, etc.
# Here, I'm just printing out the various bits and pieces of
# "datetime" that I am interested in. Note I'm using time->second
# even though I never specified seconds. It's okay, Time::Piece
# simply assumes that seconds are zero.
#
printf "%04d-%02d-%02d %02d:%02d:%02d+05:30\n",
$time->year, $time->mon, $time->mday,
$time->hour, $time->minute, $time->second;
The question does not contain any information about which time formats are possible at all. Are day of month and month always with 2 digits? Is the hour in range 0 to 9 always without a leading 0? Is it possible that the minutes 0 to 9 are without a leading 0? Which time zones are possible, just IST or others as well?
Here is a list of Perl regular expression search and replace strings which must be applied all to reformat date and time strings for time zone IST to get the date and time in format YYYY-MM-DD hh:mm:00+05:30
Reformat all times with AM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +([01]?\d:[0-5]?\d) AM IST
Replace: $3-$1-$2 $4:00+05:30
Reformat all times with 12:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +(12:[0-5]?\d) PM IST
Replace: $3-$1-$2 $4:00+05:30
Reformat all times with 01:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?1(:[0-5]?\d) PM IST
Replace: $3-$1-$2 13$4:00+05:30
Reformat all times with 02:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?2(:[0-5]?\d) PM IST
Replace: $3-$1-$2 14$4:00+05:30
Reformat all times with 03:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?3(:[0-5]?\d) PM IST
Replace: $3-$1-$2 15$4:00+05:30
Reformat all times with 04:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?4(:[0-5]?\d) PM IST
Replace: $3-$1-$2 16$4:00+05:30
Reformat all times with 05:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?5(:[0-5]?\d) PM IST
Replace: $3-$1-$2 17$4:00+05:30
Reformat all times with 06:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?6(:[0-5]?\d) PM IST
Replace: $3-$1-$2 18$4:00+05:30
Reformat all times with 07:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?7(:[0-5]?\d) PM IST
Replace: $3-$1-$2 19$4:00+05:30
Reformat all times with 08:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?8(:[0-5]?\d) PM IST
Replace: $3-$1-$2 20$4:00+05:30
Reformat all times with 09:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +0?9(:[0-5]?\d) PM IST
Replace: $3-$1-$2 21$4:00+05:30
Reformat all times with 10:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +10(:[0-5]?\d) PM IST
Replace: $3-$1-$2 22$4:00+05:30
Reformat all times with 11:xx PM in string.
Search: ([01]?\d)/([0-3]?\d)/([12][09]\d\d) +11(:[0-5]?\d) PM IST
Replace: $3-$1-$2 23$4:00+05:30
Insert leading 0 where missing in date (month, day of month) or time (hour, minute) string using an expression with a positive lookbehind and a positive lookahead:
Search: (?<=[ \-:])(\d)(?=[ \-:])
Replace: 0$1
In case of above expression does not work because of lookbehind/lookahead, it would be also possible to use less restrictive regular expression:
Search: \b(\d)\b
Replace: 0$1

regex to skip intermediate characters

Given the following string "Mon Feb 01 02:42:27 +0000 2013", what should the regex be to get the following "Feb 01 2013".
The following regex "\w{3}\s\w{3}\s\d{1,2}" will yield "Mon Feb 01". How do I write a regex to ignore the day (Mon) and the time and secs ?
I had asked a related question in an earlier post, which was kindly answered by the community - unable to parse whitespace in regex
This could work :
(Mon|Tue|Wed|Thu|Fri|Sat|Sun) [a-zA-Z]{3} ([0-9]{2}) ([0-9]{2})\:([0-9]{2})\:([0-9]{2}) \+([0-9]{4}) ([0-9]{4})