Datetime Regular Expression Doesn't Work - regex

I am having a hard time reading in string that has date and time in the format:
YYYYMMDDHHmmSS.FFFF[+|-]ZZzz
YYYY is year,
MM is month (starting at 01 to 12),
DD is day (01-31),
HH is hour (00-23),
mm is minute (00-59),
SS is second (00-59),
FFFF is the fraction of the second (0000-9999),
ZZzz is "difference in hours (ZZ – values from +14 to –12) and
minutes (zz – values 00 to 59) from the Coordinated Universal Time
(UTC)."
This is the standard for transferring date time information in HL7, but don't worry about that. The problem I am having is the system handling the regular expression I have written for this standard refuses to let me add the dot following the second field. It also will not allow for the plus or minus prior to the ZZ field.
Here is the regular expression I have written:
/^(1|2)\\d{3}(0[1-9]|1[0-2])(0[1-9]|(1|2)[0-9]|3(0|1))((0|1)[0-9]|2[0-3])[0-5][0-9][0-5][0-9]\\.\\d{4}((\\+|\\-)0[0-9]|\\-1[0-2]|\\+1[0-4])[0-5][0-9]$/
Its for Limesurvey, for its validation field for a given question. If you don't know what that is, just know that its regular expressions use Perl conventions.
Note that if I remove the \., or the \+ \-, it works just fine (with the exception the regex is no longer enforcing the standard). I've also tried not escaping the backslash, but that doesn't do anything either.
If anyone could point to why this isn't working, I would appreciate it. Note that if anything looks odd or redundant in the regex, that's most likely from me logically breaking it into the various fields for easier readability.

I barely changed your regular expression up until the +14 through -12 part. I'm not quite You can see it working here: http://www.regex101.com/r/jF1bA9
Final Regular Expression:
^(1|2)[0-9]{3}(0[1-9]|1[0-2])((0[1-9])|((1|2)[0-9])|3(0|1))((0|1)[0-9]|2[0-3])([0-5][0-9])([0-5][0-9])\.[0-9]{4}(\+0[0-9]|\+1[0-4]|-0[0-9]|-1[0-2])[0-5][0-9]$
Regular Expression explained:
Start of the line:
^ // start of line
Year:
(1|2)[0-9]{3}
Month:
(0[1-9]|1[0-2])
Day:
((0[1-9])|((1|2)[0-9])|3(0|1))
Hour:
((0|1)[0-9]|2[0-3])
Minutes:
([0-5][0-9])
Seconds:
([0-5][0-9])
Period:
\.
Fraction of the second:
[0-9]{4}
Matches +14 through -12 (What you probably need to change)
(\+0[0-9]|\+1[0-4]|-0[0-9]|-1[0-2])
Matches:
+14 +13 +12 +11 +10 +09 +08 +07 +06 +05 +04 +03 +02 +01 +00 -00 -01 -02 -03 -04 -05 -06 -07 -08 -09 -10 -11 -12
00 - 59:
[0-5][0-9]
End of Line:
$
You may need to change it to work with your specific language (I saw you had double backslashes in some areas like \\d)

Related

substitution with eval and repeat the character by grouping string length?

My input as follow
my $s = '<B>Estimated:</B>
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
<B>Instability index:</B>
The instability index (II) is computed to be 31.98
This classifies the protein as stable.';
I want to remove the <B></B> tags from string and put the underline for bold tags.
I expected output is
Estimated:
---------
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
Instability index:
------------------
The instability index (II) is computed to be 31.98
This classifies the protein as stable.
For this tried the following regex but I don't know what is the problem there.
$s=~s/<B>(.+?)<\/B>/"$1\n";"-" x length($1)/seg; # $1\n in not working
In the above regex I don't know how to put this "$1\n"? And how to use the continuous statement in substitution separated by ; or anything else?
How can I fix it?
The e modifier returns back just the last-executed statement, so
$s=~s/<B>(.+?)<\/B>/"$1\\n";"-" x length($1)/seg;
throws away the "$1\\n" (which should really be "$1\n")
This works:
$s=~s/<B>(.+?)<\/B>/"$1\n" . "-" x length($1)/seg;
The reason I was asking about your Perl version was to assess if it was possible to do what is effectively a variable-length lookbehind with \K:
$s=~s/<B>(.+?)<\/B>\K/ "\n" . "-" x length($1)/seg;
\K is available for Perl versions 5.10+.

Regular expression to match time range 00:00:00

I have dabbled with regex before for simple matches, however I think this is out of my league. I am using Google Analytics (GA) and I want to match Session Durations that come in the format of 00:00:00.
I found some articles similar to what I need but it does not match the range:
(^([0-1]?\d|2[0-9]):([0-9]?\d):([0-9]?\d)$)|(^([0-9]?\d):([0-9]?\d)$)|(^[0-9]?\d$)
The problem is I have had many visits that lasted 1 second and some for 1hr in between real visits that lasted say between 10sec and 10mins. Due to the quantity of invalid visits my average is skewed. So I want to add a filter in GA via regex to match times between 00:00:10 and 00:10:00.
You can use
/^[0-9]{2}:[0-9]{2}:[0-9]{2}$/
OR
/^\d{2}:\d{2}:\d{2}$/
if you want to match only from 00:00:00 to 99:99:99
Here '^' specifies start of pattern and '$' specifies end of pattern.
If you don't use them, the pattern will also match '99:99:99:99999', which is not the intended result. So specify them to mention the start and end of the pattern.
If you also wants to match single digit more greater than zero, like 9:9:96 and 01:8:20 etc then use
/^([1-9]{1}|[0-9]{2}):([1-9]{1}|[0-9]{2}):([1-9]{1}|[0-9]{2})$/
This may help... an answer without using groups and easy to maintain:
00:10:00|00:0[1-9]:[0-5][\d]|00:00:[1-5]\d
Works with
00:00:00 ignore
00:00:01 ignore
00:00:10 accept
00:00:11 accept
00:00:59 accept
00:00:60 ignore
00:05:03 accept
00:09:59 accept
00:10:00 accept
00:10:01 ignore
00:10:50 ignore
01:20:00 ignore
it will work with eveyrthing between 10 seconds inclusive to 10 minutes inclusive excluding everything else.
Due to the quantity of invalid visits my average is skewed. So I want
to add a filter in GA via regex to match times between 00:00:10 and
00:10:00.
Interpreting this need, try something like this:
^00:(10:00|0[1-9]:[0-5][0-9]|00:[1-5][0-9])$
which is saying:
The "hours" part must be 00
The "minutes" part can be: EITHER 01 to 09 followed by any second (00 to 59) OR 00 followed by any second between 10 and 59.
The result for a few test values:
00:00:00 NO
00:00:01 NO
00:00:10 YES
00:00:11 YES
00:00:20 YES
00:05:03 YES
00:09:59 YES
00:10:00 YES
00:10:50 NO
01:00:00 NO

Simplify regular expression for time literals (like "10h50m")

I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:
10h30m # meaning 10 hours + 30 minutes
5m30s # meaning 5 minutes + 30 seconds
10h20m15s # meaning 10 hours + 20 minutes + 15 seconds
15.6s # meaning 15.6 seconds
The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course).
As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.
So I have for all but the last group a number match like:
([0-9]+)
And for the last group even:
([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?) # to allow for .5 and 0.5 and 5.0 and 5
Going through all the combinations of h, m and s a cute little python script gives me the following regex:
(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)
Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.
You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:
h|hm|hms|ms|m|s
which is what you have currently. You can break this into:
(h|hm|hms)|(ms|m)|s
and then pulling out h from the first expression and m from the second we get (using (x|) == x?):
h(m|ms)?|ms?|s
Continuing on we get to
h(ms?)?|ms?|s
which is probably simpler (and probably the simplest).
Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as
h(d|m(d|sd?)?)?|m(d|sd?)?|sd?
(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)
This would result in something like (for just hours and minutes):
[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m
Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.
the below representation should be understandable, I dont know the exact regex syntax you're using, so you have to "translate" to the valid syntax yourself.
your hours
[0-9]{1,2}h
your minutes
[0-9]{1,2}m
your seconds
[0-9]{1,2}(\.[0-9]{1,3})?s
you want all those in order, and able to omit any of those (wrap with ?)
([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)?
this however matches things like: 10h30s
that is valid combinations are hms, hm, hs, h, ms, m and s
or iow, minutes can be ommited, but still have hours and seconds.
the other problem is if the empty string is given, it is matched, as all three ? make that valid. so you have to work around this somehow. hmm
looking at #dbaupp h(ms?)?|ms?|s you can take the above and match:
h: [0-9]{1,2}h
m: [0-9]{1,2}m
s: [0-9]{1,2}(\.[0-9]{1,3})?s
so you get to:
h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?
ms? : [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?
s : [0-9]{1,2}(\.[0-9]{1,3})?s
all those OR'd together give you a big but easy to break down regex:
([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s
which get you away with both the empty string problem and the match of hs.
looking at #Donal Fellows comment on #dbaupp answer, I'll also do (h?m)?S|h?M|H
(h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s
h?m : ([0-9]{1,2}h)?[0-9]{1,2}m
h : [0-9]{1,2}h
and merged together, you end up with something smaller than the above:
(([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h
now we have to find a way to match .xx demical representation
Here is a short Python expression that works:
(\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms]))
Inspired by Cameron Martins answer based on conditionals.
Explained:
(\d+h)? # optional int "h" (capture 1)
(\d+m)? # optional int "m" (capture 2)
(\d*\.\d+|\d+(\.\d*)?) # int or decimal
(?(2) # if "m" (capture 2) was matched:
s # "s"
| (?(1) # else if "h" (capture 1) was matched:
m # "m"
| # else (nothing matched):
[hms])) # any of the "h", "m" or "s"
You may have hours, minutes, and seconds.
/(\d{1,2}h)*(\d{1,2}m)*(\d{1,2}(\.\d+)*s)*/
should do the work. Depending on the regex library, you will get your items in order, or you will have to parse them further to check for h, m or s.
In this latter case, see also what is returned by
/(\d{1,2}(h))*(\d{1,2}(m))*(\d{1,2}(\.\d+)*(s))*/
The last group should be:
([0-9]*\.[0-9]+|[0-9]+(\.[0-9]+)?)
unless you want to match 5.
You could use regex ifs, like so:
(([0-9]+h)?([0-9]+m)?([0-9]+s)?)(?(?<=h)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m)?|(?(?<=m)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)?|\b(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)[hms])?))
Here - http://regexr.com?31dmj
I havn't checked that this works, but it trys to match just integers for hours, minutes, then seconds first, then if the last thing matched is hours, it allows fractional minutes, otherwise if the last thing matched is minutes, it allows fractional seconds.

Regular expression for matching HH:MM time format

I want a regexp for matching time in HH:MM format. Here's what I have, and it works:
^[0-2][0-3]:[0-5][0-9]$
This matches everything from 00:00 to 23:59.
However, I want to change it so 0:00 and 1:00, etc are also matched as well as 00:00 and 01:30. I.e to make the leftmost digit optional, to match HH:MM as well as H:MM.
Any ideas how to make that change? I need this to work in javascript as well as php.
Your original regular expression has flaws: it wouldn't match 04:00 for example.
This may work better:
^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$
Regular Expressions for Time
HH:MM 12-hour format, optional leading 0
/^(0?[1-9]|1[0-2]):[0-5][0-9]$/
HH:MM 12-hour format, optional leading 0, mandatory meridiems (AM/PM)
/((1[0-2]|0?[1-9]):([0-5][0-9]) ?([AaPp][Mm]))/
HH:MM 24-hour with leading 0
/^(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$/
HH:MM 24-hour format, optional leading 0
/^([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$/
HH:MM:SS 24-hour format with leading 0
/(?:[01]\d|2[0-3]):(?:[0-5]\d):(?:[0-5]\d)/
Reference and Demo
None of the above worked for me.
In the end I used:
^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$ (js engine)
Logic:
The first number (hours) is either:
a number between 0 and 19 --> [0-1]?[0-9] (allowing single digit number)
or
a number between 20 - 23 --> 2[0-3]
the second number (minutes) is always a number between 00 and 59 --> [0-5][0-9] (not allowing a single digit)
You can use this one 24H, seconds are optional
^([0-1]?[0-9]|[2][0-3]):([0-5][0-9])(:[0-5][0-9])?$
The best would be for HH:MM without taking any risk.
^(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$
Amazingly I found actually all of these don't quite cover it, as they don't work for shorter format midnight of 0:0 and a few don't work for 00:00 either, I used and tested the following:
^([0-9]|0[0-9]|1?[0-9]|2[0-3]):[0-5]?[0-9]$
You can use this regular expression:
^(2[0-3]|[01]?[0-9]):([1-5]{1}[0-9])$
If you want to exclude 00:00, you can use this expression
^(2[0-3]|[01]?[0-9]):(0[1-9]{1}|[1-5]{1}[0-9])$
Second expression is better option because valid time is 00:01 to 00:59 or 0:01 to 23:59. You can use any of these upon your requirement.
Regex101 link
As you asked the left most bit optional, I have done left most and right most bit optional too, check it out
^([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]?$
It matches with
0:0
00:00
00:0
0:00
23:59
01:00
00:59
The live link is available here
None of the above answers worked for me, the following one worked.
"[0-9]{2}:[0-9]{2}"
To validate 24h time, use:
^([0-1]?[0-9]|2?[0-3]|[0-9])[:\-\/]([0-5][0-9]|[0-9])$
This accepts:
22:10
2:10
2/1
...
But does not accept:
25:12
12:61
...
Description
hours:minutes with:
Mandatory am|pm or AM|PM
Mandatory leading zero 05:01 instead of 5:1
Hours from 01 up to 12
Hours does not accept 00 as in 00:16 am
Minutes from 00 up to 59
01:16 am ✅
01:16 AM ✅
01:16 ❌ (misses am|pm)
01:16 Am❌ (am must all be either lower or upper case)
1:16 am ❌ (Hours misses leading zero)
00:16 ❌ (Invalid hours value 00)
Regular Expression
To match single occurrence:
^(0[1-9]|1[0-2]):([0-5][0-9]) ((a|p)m|(A|P)M)$
To match multiple occurrences:
Remove ^ $
(0[1-9]|1[0-2]):([0-5][0-9]) ((a|p)m|(A|P)M)
You can use following regex:
^[0-1][0-9]:[0-5][0-9]$|^[2][0-3]:[0-5][0-9]$|^[2][3]:[0][0]$
Declare
private static final String TIME24HOURS_PATTERN = "([01]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]";
public boolean validate(final String time) {
pattern = Pattern.compile(TIME24HOURS_PATTERN);
matcher = pattern.matcher(time);
return matcher.matches();
}
This method return "true" when String match with the Regular Expression.
A slight modification to Manish M Demblani's contribution above
handles 4am
(I got rid of the seconds section as I don't need it in my application)
^(([0-1]{0,1}[0-9]( )?(AM|am|aM|Am|PM|pm|pM|Pm))|(([0]?[1-9]|1[0-2])(:|\.)[0-5][0-9]( )?(AM|am|aM|Am|PM|pm|pM|Pm))|(([0]?[0-9]|1[0-9]|2[0-3])(:|\.)[0-5][0-9]))$
handles:
4am
4 am
4:00
4:00am
4:00 pm
4.30 am
etc..
The below regex will help to validate hh:mm format
^([0-1][0-9]|2[0-3]):[0-5][0-9]$
Your code will not work properly as it will not work for 01:00 type formats. You can modify it as follows.
pattern =r"^(0?[1-9]|1[0-2]):[0-5][0-9]$"
Making it less complicated we can use a variable to define our hours limits.Further we can add meridiems for more accurate results.
hours_limit = 12
pattern = r"^[1-hours_limit]:[0-5][0-9]\s?[AaPp][Mm]$"
print(re.search(pattern, "2:59 pm"))
Check this one
/^([0-1]?[0-9]|2[0-3]):([0-5]?[0-9]|5[0-9])$/
Mine is:
^(1?[0-9]|2[0-3]):[0-5][0-9]$
This is much shorter
Got it tested with several example
Match:
00:00
7:43
07:43
19:00
18:23
And doesn't match any invalid instance such as 25:76 etc ...
You can try the following
^\d{1,2}([:.]?\d{1,2})?([ ]?[a|p]m)?$
It can detect the following patterns :
2300
23:00
4 am
4am
4pm
4 pm
04:30pm
04:30 pm
4:30pm
4:30 pm
04.30pm
04.30 pm
4.30pm
4.30 pm
23:59
0000
00:00
check this masterfull timestamp detector regex I built to look for a user-specified timestamp, examples of what it will pickup include, but is most definitely NOT limited to;
8:30-9:40
09:40-09 : 50
09 : 40-09 : 50
09:40 - 09 : 50
08:00to05:00
08 : 00to05 : 00
08:00 to 05:00
8am-09pm
08h00 till 17h00
8pm-5am
08h00,21h00
06pm untill 9am
It'll also pickup many more, as long as the times include digits
Try the following
^([0-2][0-3]:[0-5][0-9])|(0?[0-9]:[0-5][0-9])$
Note: I was assuming the javascript regex engine. If it's different than that please let me know.
You can use following regex :
^[0-2]?[0-3]:[0-5][0-9]$
Only modification I have made is leftmost digit is optional. Rest of the regex is same.

Regular Expression to match valid dates

I'm trying to write a regular expression that validates a date. The regex needs to match the following
M/D/YYYY
MM/DD/YYYY
Single digit months can start with a leading zero (eg: 03/12/2008)
Single digit days can start with a leading zero (eg: 3/02/2008)
CANNOT include February 30 or February 31 (eg: 2/31/2008)
So far I have
^(([1-9]|1[012])[-/.]([1-9]|[12][0-9]|3[01])[-/.](19|20)\d\d)|((1[012]|0[1-9])(3[01]|2\d|1\d|0[1-9])(19|20)\d\d)|((1[012]|0[1-9])[-/.](3[01]|2\d|1\d|0[1-9])[-/.](19|20)\d\d)$
This matches properly EXCEPT it still includes 2/30/2008 & 2/31/2008.
Does anyone have a better suggestion?
Edit: I found the answer on RegExLib
^((((0[13578])|([13578])|(1[02]))[\/](([1-9])|([0-2][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/](([1-9])|([0-2][0-9])|(30)))|((2|02)[\/](([1-9])|([0-2][0-9]))))[\/]\d{4}$|^\d{4}$
It matches all valid months that follow the MM/DD/YYYY format.
Thanks everyone for the help.
This is not an appropriate use of regular expressions. You'd be better off using
[0-9]{2}/[0-9]{2}/[0-9]{4}
and then checking ranges in a higher-level language.
Here is the Reg ex that matches all valid dates including leap years. Formats accepted mm/dd/yyyy or mm-dd-yyyy or mm.dd.yyyy format
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
courtesy Asiq Ahamed
I landed here because the title of this question is broad and I was looking for a regex that I could use to match on a specific date format (like the OP). But I then discovered, as many of the answers and comments have comprehensively highlighted, there are many pitfalls that make constructing an effective pattern very tricky when extracting dates that are mixed-in with poor quality or non-structured source data.
In my exploration of the issues, I have come up with a system that enables you to build a regular expression by arranging together four simpler sub-expressions that match on the delimiter, and valid ranges for the year, month and day fields in the order you require.
These are :-
Delimeters
[^\w\d\r\n:]
This will match anything that is not a word character, digit character, carriage return, new line or colon. The colon has to be there to prevent matching on times that look like dates (see my test Data)
You can optimise this part of the pattern to speed up matching, but this is a good foundation that detects most valid delimiters.
Note however; It will match a string with mixed delimiters like this 2/12-73 that may not actually be a valid date.
Year Values
(\d{4}|\d{2})
This matches a group of two or 4 digits, in most cases this is acceptable, but if you're dealing with data from the years 0-999 or beyond 9999 you need to decide how to handle that because in most cases a 1, 3 or >4 digit year is garbage.
Month Values
(0?[1-9]|1[0-2])
Matches any number between 1 and 12 with or without a leading zero - note: 0 and 00 is not matched.
Date Values
(0?[1-9]|[12]\d|30|31)
Matches any number between 1 and 31 with or without a leading zero - note: 0 and 00 is not matched.
This expression matches Date, Month, Year formatted dates
(0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})
But it will also match some of the Year, Month Date ones. It should also be bookended with the boundary operators to ensure the whole date string is selected and prevent valid sub-dates being extracted from data that is not well-formed i.e. without boundary tags 20/12/194 matches as 20/12/19 and 101/12/1974 matches as 01/12/1974
Compare the results of the next expression to the one above with the test data in the nonsense section (below)
\b(0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})\b
There's no validation in this regex so a well-formed but invalid date such as 31/02/2001 would be matched. That is a data quality issue, and as others have said, your regex shouldn't need to validate the data.
Because you (as a developer) can't guarantee the quality of the source data you do need to perform and handle additional validation in your code, if you try to match and validate the data in the RegEx it gets very messy and becomes difficult to support without very concise documentation.
Garbage in, garbage out.
Having said that, if you do have mixed formats where the date values vary, and you have to extract as much as you can; You can combine a couple of expressions together like so;
This (disastrous) expression matches DMY and YMD dates
(\b(0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})\b)|(\b(0?[1-9]|1[0-2])[^\w\d\r\n:](0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](\d{4}|\d{2})\b)
BUT you won't be able to tell if dates like 6/9/1973 are the 6th of September or the 9th of June. I'm struggling to think of a scenario where that is not going to cause a problem somewhere down the line, it's bad practice and you shouldn't have to deal with it like that - find the data owner and hit them with the governance hammer.
Finally, if you want to match a YYYYMMDD string with no delimiters you can take some of the uncertainty out and the expression looks like this
\b(\d{4})(0[1-9]|1[0-2])(0[1-9]|[12]\d|30|31)\b
But note again, it will match on well-formed but invalid values like 20010231 (31th Feb!) :)
Test data
In experimenting with the solutions in this thread I ended up with a test data set that includes a variety of valid and non-valid dates and some tricky situations where you may or may not want to match i.e. Times that could match as dates and dates on multiple lines.
I hope this is useful to someone.
Valid Dates in various formats
Day, month, year
2/11/73
02/11/1973
2/1/73
02/01/73
31/1/1973
02/1/1973
31.1.2011
31-1-2001
29/2/1973
29/02/1976
03/06/2010
12/6/90
month, day, year
02/24/1975
06/19/66
03.31.1991
2.29.2003
02-29-55
03-13-55
03-13-1955
12\24\1974
12\30\1974
1\31\1974
03/31/2001
01/21/2001
12/13/2001
Match both DMY and MDY
12/12/1978
6/6/78
06/6/1978
6/06/1978
using whitespace as a delimiter
13 11 2001
11 13 2001
11 13 01
13 11 01
1 1 01
1 1 2001
Year Month Day order
76/02/02
1976/02/29
1976/2/13
76/09/31
YYYYMMDD sortable format
19741213
19750101
Valid dates before Epoch
12/1/10
12/01/660
12/01/00
12/01/0000
Valid date after 2038
01/01/2039
01/01/39
Valid date beyond the year 9999
01/01/10000
Dates with leading or trailing characters
12/31/21/
31/12/1921AD
31/12/1921.10:55
12/10/2016 8:26:00.39
wfuwdf12/11/74iuhwf
fwefew13/11/1974
01/12/1974vdwdfwe
01/01/99werwer
12321301/01/99
Times that look like dates
12:13:56
13:12:01
1:12:01PM
1:12:01 AM
Dates that runs across two lines
1/12/19
74
01/12/19
74/13/1946
31/12/20
08:13
Invalid, corrupted or nonsense dates
0/1/2001
1/0/2001
00/01/2100
01/0/2001
0101/2001
01/131/2001
31/31/2001
101/12/1974
56/56/56
00/00/0000
0/0/1999
12/01/0
12/10/-100
74/2/29
12/32/45
20/12/194
2/12-73
Maintainable Perl 5.10 version
/
(?:
(?<month> (?&mon_29)) [\/] (?<day>(?&day_29))
| (?<month> (?&mon_30)) [\/] (?<day>(?&day_30))
| (?<month> (?&mon_31)) [\/] (?<day>(?&day_31))
)
[\/]
(?<year> [0-9]{4})
(?(DEFINE)
(?<mon_29> 0?2 )
(?<mon_30> 0?[469] | (11) )
(?<mon_31> 0?[13578] | 1[02] )
(?<day_29> 0?[1-9] | [1-2]?[0-9] )
(?<day_30> 0?[1-9] | [1-2]?[0-9] | 30 )
(?<day_31> 0?[1-9] | [1-2]?[0-9] | 3[01] )
)
/x
You can retrieve the elements by name in this version.
say "Month=$+{month} Day=$+{day} Year=$+{year}";
( No attempt has been made to restrict the values for the year. )
To control a date validity under the following format :
YYYY/MM/DD or YYYY-MM-DD
I would recommand you tu use the following regular expression :
(((19|20)([2468][048]|[13579][26]|0[48])|2000)[/-]02[/-]29|((19|20)[0-9]{2}[/-](0[4678]|1[02])[/-](0[1-9]|[12][0-9]|30)|(19|20)[0-9]{2}[/-](0[1359]|11)[/-](0[1-9]|[12][0-9]|3[01])|(19|20)[0-9]{2}[/-]02[/-](0[1-9]|1[0-9]|2[0-8])))
Matches
2016-02-29 | 2012-04-30 | 2019/09/31
Non-Matches
2016-02-30 | 2012-04-31 | 2019/09/35
You can customise it if you wants to allow only '/' or '-' separators.
This RegEx strictly controls the validity of the date and verify 28,30 and 31 days months, even leap years with 29/02 month.
Try it, it works very well and prevent your code from lot of bugs !
FYI : I made a variant for the SQL datetime. You'll find it there (look for my name) : Regular Expression to validate a timestamp
Feedback are welcomed :)
Sounds like you're overextending regex for this purpose. What I would do is use a regex to match a few date formats and then use a separate function to validate the values of the date fields so extracted.
Perl expanded version
Note use of /x modifier.
/^(
(
( # 31 day months
(0[13578])
| ([13578])
| (1[02])
)
[\/]
(
([1-9])
| ([0-2][0-9])
| (3[01])
)
)
| (
( # 30 day months
(0[469])
| ([469])
| (11)
)
[\/]
(
([1-9])
| ([0-2][0-9])
| (30)
)
)
| ( # 29 day month (Feb)
(2|02)
[\/]
(
([1-9])
| ([0-2][0-9])
)
)
)
[\/]
# year
\d{4}$
| ^\d{4}$ # year only
/x
Original
^((((0[13578])|([13578])|(1[02]))[\/](([1-9])|([0-2][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/](([1-9])|([0-2][0-9])|(30)))|((2|02)[\/](([1-9])|([0-2][0-9]))))[\/]\d{4}$|^\d{4}$
if you didn't get those above suggestions working, I use this, as it gets any date I ran this expression through 50 links, and it got all the dates on each page.
^20\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(0[1-9]|[1-2][0-9]|3[01])$
This regex validates dates between 01-01-2000 and 12-31-2099 with matching separators.
^(0[1-9]|1[012])([- /.])(0[1-9]|[12][0-9]|3[01])\2(19|20)\d\d$
var dtRegex = new RegExp(/[1-9\-]{4}[0-9\-]{2}[0-9\-]{2}/);
if(dtRegex.test(date) == true){
var evalDate = date.split('-');
if(evalDate[0] != '0000' && evalDate[1] != '00' && evalDate[2] != '00'){
return true;
}
}
Regex was not meant to validate number ranges(this number must be from 1 to 5 when the number preceding it happens to be a 2 and the number preceding that happens to be below 6).
Just look for the pattern of placement of numbers in regex. If you need to validate is qualities of a date, put it in a date object js/c#/vb, and interogate the numbers there.
I know this does not answer your question, but why don't you use a date handling routine to check if it's a valid date? Even if you modify the regexp with a negative lookahead assertion like (?!31/0?2) (ie, do not match 31/2 or 31/02) you'll still have the problem of accepting 29 02 on non leap years and about a single separator date format.
The problem is not easy if you want to really validate a date, check this forum thread.
For an example or a better way, in C#, check this link
If you are using another platform/language, let us know
Perl 6 version
rx{
^
$<month> = (\d ** 1..2)
{ $<month> <= 12 or fail }
'/'
$<day> = (\d ** 1..2)
{
given( +$<month> ){
when 1|3|5|7|8|10|12 {
$<day> <= 31 or fail
}
when 4|6|9|11 {
$<day> <= 30 or fail
}
when 2 {
$<day> <= 29 or fail
}
default { fail }
}
}
'/'
$<year> = (\d ** 4)
$
}
After you use this to check the input the values are available in $/ or individually as $<month>, $<day>, $<year>. ( those are just syntax for accessing values in $/ )
No attempt has been made to check the year, or that it doesn't match the 29th of Feburary on non leap years.
If you're going to insist on doing this with a regular expression, I'd recommend something like:
( (0?1|0?3| <...> |10|11|12) / (0?1| <...> |30|31) |
0?2 / (0?1| <...> |28|29) )
/ (19|20)[0-9]{2}
This might make it possible to read and understand.
/(([1-9]{1}|0[1-9]|1[0-2])\/(0[1-9]|[1-9]{1}|[12]\d|3[01])\/[12]\d{3})/
This would validate for following -
Single and 2 digit day with range from 1 to 31. Eg, 1, 01, 11, 31.
Single and 2 digit month with range from 1 to 12. Eg. 1, 01, 12.
4 digit year. Eg. 2021, 1980.
A slightly different approach that may or may not be useful for you.
I'm in php.
The project this relates to will never have a date prior to the 1st of January 2008. So, I take the 'date' inputed and use strtotime(). If the answer is >= 1199167200 then I have a date that is useful to me. If something that doesn't look like a date is entered -1 is returned. If null is entered it does return today's date number so you do need a check for a non-null entry first.
Works for my situation, perhaps yours too?