Something concerns me about regular expression - regex

I want to match some sub-string like date in month as "21st" or "22nd" or "23rd" in a string, so I made a regular expression using this pattern:
((\d{1,2})(st)|(nd)|(rd)|(th)).
I made these group because I want to do replace. But when I match some string like "Monday March 21st 2012", it always matches two sub-string: Mo'nd'ay March '21st' 2012.
So I am confused why it matches "Mo'nd'ay"?

Because you have a missing set of parenthesis. Try:
((\d{1,2})((st)|(nd)|(rd)|(th)))
What you had, matched:
(\d{1,2})(st)
OR (nd)
OR (rd)
OR (th)

You don't have correct parenthesis around your |s. You have ((\d{1,2})(st)|(nd)|(rd)|(th)), but you should have: (\d{1,2})(st|nd|rd|th).
You're matching the strings nd, rd, th, or (one or two digits followed by st).

Related

Is it possible to negate a group in a regular expression?

Let's say that we have this text:
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-12
2020-10-16
2020-11-12
2020-11-23
2020-11-15
2020-12-01
2020-12-11
2020-12-30
I want to do something like this:
\d\d\d\d-(NOT10)-(30)
So i want to get all dates of any year, but not of the 10th month and it is important, that the day is 30.
I tried a lot to do this using negative lookahead asserations but i did not come up with any working regexes.
You can use negative lookaheads:
\d\d\d\d-(?!10)\d\d-30
The Part (?!10) ensures that no 10 follows at the point where it is inserted into the regex. Notice that you still need to match the following digits afterwards, thus the \d\d part.
Generally speaking you can not (to my knowledge) negate a part that then also matches parts of the string. But with negative lookaheads you can simulate this as I did above. The generalized idea looks something like:
(?!<special-exclusion-pattern>)<general-inclusion-pattern>
Where the special-exclusion-pattern matches a subset of the general-inclusion-pattern. In the above case the general inclusion pattern is \d\d and the special exclusion pattern ins 10.
Try :
/20\d{2}-(?:0[1-9]|1[12])-30/
Explanation :
20\d{2} it will match 20XX
(?:0[1-9]|1[12]) it will match 0X or 11, 12
30 it will match 30
Demo :https://regex101.com/r/O2F1eV/1
It's easiest to simply convert the substring (if present) that matches /^\d{4}-10-30$/ to an empty string, then split the resulting string on one or more newlines.
If your string were
2020-10-16
2020-10-30
2020-11-12
2020-11-23
and was held by the variable str, then in Ruby, for example,
str.sub(/^\d{4}-10-30$/,'')
#=> "2020-10-16\n\n2020-11-12\n2020-11-23\n"
so
str.sub(/^\d{4}-10-30$/,'').split
#=> ["2020-10-16", "2020-11-12", "2020-11-23"]
Whatever language you are using undoubtedly has similar methods.

How to creating a regex pattern in VBA to extract dates from string and exclude false matches

I am trying to use Regex to parse a series of strings to extract one or more text dates that may be in multiple formats. The strings will look something like the following:
24 Aug 2016: nno-emvirt010a/b; 16 Aug 2016 nnt-emvirt010a/b nnd-emvirt010a/b COSI-1.6.5
24.16 nno-emvirt010a/b nnt-emvirt010a/b nnd-emvirt010a/b EI.01.02.03\
9/23/16: COSI-1.6.5 Logs updated at /vobs/COTS/1.6.5/files/Status_2016-07-27.log, Status_2016-07-28.log, Status_2016-08-05.log, Status_2016-08-08.log
I am not concerned about validating the individual date fields; just extracting the date string. The part I am unable to figure out is how to not match on number sequences that match the pattern but aren’t dates (‘1.6.5’ in ex. (1) and 01.02.03 in ex. (2)) and dates that are part of a file name (2016-07-27 in ex. (3)). In each of these exception cases in my input data, the initial numbers are preceded by either a period(.), underscore (_) or dash (-), but I cannot determine how to use this to edit the pattern syntax to not match these strings.
The pattern I have that partially works is below. It will only ignore the non date matches if it starts with 1 digit as in example 1.
/[^_\.\(\/]\d{1,4}[/\-\.\s*]([1-9]|0[1-9]|[12][0-9]|3[01]|[a-z]{3})[/\-\.\s*]\d{1,4}/ig`
I am not sure about vba check if this works . seems they have given so much options : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s04.html
^(?:(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])|↵
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9]))/(?:[0-9]{2})?[0-9]{2}$
^(?:
# m/d or mm/dd
(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])
|
# d/m or dd/mm
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9])
)
# /yy or /yyyy
/(?:[0-9]{2})?[0-9]{2}$
According to the test strings you've presented, you can use the following regex
See this regex in use here
(?<=[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
This regex ensures that specific date formats are met and are preceded by nothing (beginning of the string) or by a non-word character (specifically a-z, A-Z, 0-9) or dot .. The date formats that will be matched are:
24 Aug 2016
24.16
9/23/16
The regex could be further manipulated to ensure numbers are in the proper range according to days/month, etc., however, I don't feel that is really necessary.
Edits
Edit 1
Since VBA doesn't support lookbehinds, you can use the following. The date is in capture group 1.
(?:[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
Edit 2
As per bulbus's comment below
(?:[^\w.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d{2,4})|(?:(?:\d{‌1,2}\/){2}\d{2,4})|(‌​?:\d{2,4}(?:-\d{2}){‌​2})|\d{2}\.\d{2})
Took liberty to edit that a bit.
replaced [^a-zA-Z\d.] with [^\w.], comes with added advantage of excluding dates with _2016-07-28.log
Due to 1 removed trailing condition (?=[^a-zA-Z\d.]).
Forced year digits from \d+ to \d{2,4}
Edit 3
Due to added conditions of the regex, I've made the following edits (to improve upon both previous edits). As per the OP:
The edited pattern above works in all but 2 cases:
it does not find dates with the year first (ex. 2016/07/11)
if the date is contained within parenthesis in the string, it returns the left parenthesis as part of the date (ex. match = (8/20/2016)
Can you provide the edit to fix these?
In the below regexes, I've changed years to \d+ in order for it to work on any year greater than or equal to 0.
See the code in use here
(?:[^\w.]|^)((?:\d{1,2}\s+[A-Z][a-z]{2}\s+\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:\/\d{1,2}){2})|(?:\d+(?:-\d{2}){2})|\d{2}\.\d+)
This regex adds the possibility of dates in the XXXX/XX/XX format where the date may appear first.
The reason you are getting ( as a match before the regex is the nature of the Full Match. You need to, instead, grab the value of the first capture group and not the whole regex result. See this answer on how to grab submatches from a regex pattern in VBA.
Also, note that any additional date formats you need to catch need to be explicitly set in the regex. Currently, the regex supports the following date formats:
\d{1,2}\s+[A-Z][a-z]{2}\s+\d+
12 Apr 17
12 Apr 2017
(?:\d{1,2}\/){2}\d+
1/4/17
01/04/17
1/4/2017
01/04/2017
\d+(?:\/\d{1,2}){2}
17/04/01
2017/4/1
2017/04/01
17/4/1
\d+(?:-\d{2}){2}
17-04-01
2017-04-01
\d{2}\.\d+ - Although I'm not sure what this date format is even used for and how it could be considered efficient if it's missing month
24.16

RegEx not matching date

Why does the date RegEx:
^(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])$
Does not match 1999/01-01?
Can't figure this out. Is it because of delimiters?
This regex uses a reference to the second captured group using \2. Your second captured group is:
^(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])$*
^^^^^^^^
Which is the delimiter of your date, it could be any of the following - /.. As of now, your second delimiter have to be the same than the first one (because of the reference to the second capturing group) and that's why you can't match string of the format xxxxZxxYxx if Z != Y.
If you want such case to be matched, you can change your regex to:
^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$*
But note that if this is a sort of correct regex to find a date in the range 1900-2099, this will not allow you to test if the date is correct or not (it doesn't check the correct pairing of days number with month, ie: you can have 31 days in February).

perl regular expression digit match

Doing the below regex match to verify whether date is in the YYYY_MM_DD Format. But the regular expression gives an error message if i have a value of 2012_07_7. Date part and month should be exactly 2 digits according to the regex pattern. Not sure why it's not working.
if ($cmdParams{RunId} !~ m/^\d{4}_\d{2}_\d{2}$/)
{
print "Not a valid date in the format YYYY_MM_DD";
}
Your regex specifies exactly 2 digits for the day component, if you want to allow either 1 or 2 digits you should use {1,2} rather than {2}
Well if you look at your data that you have: 2012_07_7 you can see that the day-part is not of two digits.
Obviously. Your pattern dictates that the last numeric chunk should be of two digits, whereas you are providing 1. So if you want your pattern to match this text, try something like:
if ($cmdParams{RunId} !~ m/^\d{4}_\d{2}_\d\d?$/)
My solution: ^\d{4}_(?:1[0-2]|0?[1-9])_(?:3[01]|[1-2]\d|0?[1-9])$
this pattern match: 2000_12_01 or 2001_1_1 or 2001_02_1

How do I write a Regular Expression to match any three digit number value?

I'm working with some pretty funky HTML markup that I inherited, and I need to remove the following attributes from about 72 td elements.
sdval="285"
I know I can do this with find/replace in my code editor, except since the value of each attribute is different by 5 degree increments, I can't match them all without a Regular Expression. (FYI I'm using Esspress and it does support RegExes in it's Find/Replace tool)
Only trouble is, I really can't figure out how to write a RegEx for this value. I understand the concept of RegExes, but really don't know how to use them.
So how would I write the following with a Regular Expression in place of the digits so that it would match any three digit value?
sdval="285"
/sdval="\d{3}"/
EDIT:
To answer your comment, \d in regular expressions means match any digit, and the {n} construct means repeat the previous item n times.
Easiest, most portable: [0-9][0-9][0-9]
More "modern": \d{3}
This should do (ignores leading zeros):
[1-9][0-9]{0,2}
import re
data = "719"
data1 = "79"
# This expression will match any single, double or triple digit Number
expression = '[\d]{1,3}'
print(re.search(expression, data).string)
# This expression will match only triple digit Number
expression1 = '[\d]{3}'
print(re.search(expression1, data1).string)
Output :
expression : 719
expression1 : 79
It sounds like you're trying to do a find / replace in Visual Studio of a 3 digit number (references to Express and Find/Replace tool). If that's the case the regex to find a 3 digit number in Visual Studio is the following
<:d:d:d>
Breakdown
The < and > establish a word boundary to make sure we don't match a number subset.
Each :d entry matches a single digit.