RegEx to match specific sentence plus date and time - regex

I've tried figuring out how to make regex match something specific follow by a date and a time. I cannot for the life of me figure it out!
I want to match the following sentence, where the date and time of course may be random:
Den 25/01/2013 kl. 14.03 skrev
So it should match like this: Den dd/mm/yyyy kl. hh.mm skrev
Note that time is in 24-hour format.
Can anyone help here? I can easily find an example that matches a date or time, but I don't know how to combine it with this specific sentence :(
Thanks in advance

Use it just by combining them as:
Den (0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[0-2])/(0{3}[1-9]|((?!0{3}\d)\d{4})) kl\. ([01][0-9]|[2[0-3])\.([0-5][0-9]) skrev
Note : Date not validated properly. Will match 30/02/2000
Den matches Den as such.
(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[0-2])/\d{3}[1-9] matches date. 0{3}[1-9]|((?!0{3}\d)\d{4}) avoids 0000 as year.
kl\. matches kl. The \ before the . is to escape . which is a special character
in regex
([01][0-9]|[2[0-3])\.([0-5][0-9]) matches time from 00.00 to 23.59
skrev matches skrev as such.
The following validates date a bit more well
Den ((0[1-9]|[12][0-9]|3[01])/(?=(0[13578]|1[02]))(0[13578]|1[02])|(0[1-9]|[12][0-9]|30)/(?=(0[469]|11))(0[469]|11)|(0[1-9]|[12][0-9])/(?=(02))(02))/(0{3}[1-9]|((?!0{3}\d)\d{4})) kl\. ([01][0-9]|[2[0-3])\.([0-5][0-9]) skrev
Still matches 29/02/1999 - No validation for leap year or not
To match single digit days and months also, replace the date part with the following:
(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[0-2])/(0{3}[1-9]|((?!0{3}\d)\d{4}))
The ? makes the preceding part optional i.e. the 0 becomes optional.

`Den ([0-3]\d)/([0-1]\d)/(\d{4}) kl\. ([0-2]\d)\.([0-5]\d) skrev`
to catch the values in order to facilitate validation.

maybe not the smartest solution, but this expression should fit your request:
Den\s[0-3][0-9]/[0-1][[0-9]/[0-9][0-9][0-9][0-9]\skl.\s[0-2][0-9].[0-6][0-9]\sskrev

Turns out all I actually needed was:
(Den ../../.... kl. ..... skrev)
Since . just matches random characters, and because this sentence is auto-generated by an e-mail client, there is no need to actually validate if it's a date, but merely look for this pattern and discard everything after. Nobody would ever write that so specifically in the middle of regular text.
In case anybody is wondering, this is for SpiceWorks reply header filtering.

Related

Regex to validate date format

I'm figuring out a way to validate my input date in Altva Mapforce which is in the format "YYYYMMDD".
I know to verify year I can use [0-9]{4} but I'm having trouble figuring out a way to "restrict" date range to "01-31" and month to "01-12". Please note "01" is valid while "1" should be invalid.
Can someone please provide a regex expression to validate this sort of input?
From searching internet I got one for month: ([1-9]|[12]\d|3[01]) but this one is valid for range 1-31. I want 01-31 and so on.
For a month from 1-12 adding a zero for a single digit 1-9:
(?:0[1-9]|1[012])
For a day 1-31 adding a zero for a single digit 1-9:
(?:0[1-9]|[12]\d|3[01])
Putting it all together with 4 digits to match a year (note that \d{4} can also match 0000 and 9999), enclosed in word boundaries \b to prevent a partial match with leading or trailing digits / word characters:
\b\d{4}(?:0[1-9]|1[012])(?:0[1-9]|[12]\d|3[01])\b
A variation, limiting the scope for a year to for example 1900 - 2099
\b(?:19|20)\d{2}(?:0[1-9]|1[012])(?:0[1-9]|[12]\d|3[01])\b
But note that this does not validate the date itself, it can for example also match 20210231. To validate a date, use a designated api for handling a date in the tool or code.

Is it possible to negate a group in a regular expression?

Let's say that we have this text:
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-12
2020-10-16
2020-11-12
2020-11-23
2020-11-15
2020-12-01
2020-12-11
2020-12-30
I want to do something like this:
\d\d\d\d-(NOT10)-(30)
So i want to get all dates of any year, but not of the 10th month and it is important, that the day is 30.
I tried a lot to do this using negative lookahead asserations but i did not come up with any working regexes.
You can use negative lookaheads:
\d\d\d\d-(?!10)\d\d-30
The Part (?!10) ensures that no 10 follows at the point where it is inserted into the regex. Notice that you still need to match the following digits afterwards, thus the \d\d part.
Generally speaking you can not (to my knowledge) negate a part that then also matches parts of the string. But with negative lookaheads you can simulate this as I did above. The generalized idea looks something like:
(?!<special-exclusion-pattern>)<general-inclusion-pattern>
Where the special-exclusion-pattern matches a subset of the general-inclusion-pattern. In the above case the general inclusion pattern is \d\d and the special exclusion pattern ins 10.
Try :
/20\d{2}-(?:0[1-9]|1[12])-30/
Explanation :
20\d{2} it will match 20XX
(?:0[1-9]|1[12]) it will match 0X or 11, 12
30 it will match 30
Demo :https://regex101.com/r/O2F1eV/1
It's easiest to simply convert the substring (if present) that matches /^\d{4}-10-30$/ to an empty string, then split the resulting string on one or more newlines.
If your string were
2020-10-16
2020-10-30
2020-11-12
2020-11-23
and was held by the variable str, then in Ruby, for example,
str.sub(/^\d{4}-10-30$/,'')
#=> "2020-10-16\n\n2020-11-12\n2020-11-23\n"
so
str.sub(/^\d{4}-10-30$/,'').split
#=> ["2020-10-16", "2020-11-12", "2020-11-23"]
Whatever language you are using undoubtedly has similar methods.

Regex expression for date within dates range

I need to validate with regex a date in format yyyy-mm-dd (2019-12-31) that should be within the range 2019-12-20 - 2020-01-10.
What would be the regex for this?
Thanks
Regex only deal with characters. so we have to work out at each position in the date what are the valid characters.
The first part is easy. The first two characters have to be 20
Now it gets complicated the next character can be a 1 or a 2 but what follows depends on the value of that character so we split the rest of the regex into two sections the first if the third character matches 1 and the second if it matches 2
We know that if the third character is a 1 then what must follow is the characters 9-12- as the range starts at 2019-12-20 now for the day part. The 9th character is the tens for the day this can only be 2 or 3 as we are already in the last month and the minimum date is 20. The last character can be any digit 0-9. This gives us a day match of [23][0-9]. Putting this together we now have a pattern for years starting 2019 as 19-12-[23][0-9]
It the third character is a 2 then we can match up to the day part of the date a gain as the range ends in January. This gives us a partial match of 20-01- leaving us to work on the day part. Hear we know that the first character of the day can either be a 1 or 0 however if it's a 1 then the last character must be a 0 and if it's a 0 then the last character can only be in the range 1 to 9. This give us another alteration (?:0[1-9]|10) Putting the second part together we get 20-01-(?:0[1-9]|10).
Combining these together gives the final regex 20(?:19-12-[23][0-9]|20-01-(?:0[1-9]|10))
Note that I'm assuming that the date you are testing against is a validly formatted date.
Try this:
(2019|2020)\-(12|01)\-([0-3][0-9]|[0-9])
But be aware that this will allow number up to where the first digit is between zero and three and the second digit between zero and nine for the dd value. You could specify all numbers you want to allow (from 20 to 10) like this (20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10).
(2019|2020)\-(12|01)\-(20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10)
But honestly... Regular-Expressions are not the right tool for this. RegExp gives a mask to something, not a logical context. Use regex to extract the data/value from a string and validate those values using another language.
The above 2nd Regex will, f.e. match your dates, but also values outside of this range since there is no context between 2019|2020 and the second group 12|01 so they match values like 2019-12-11 but also 2020-12-11.
To only match the values you want this will be a really large regex like this (inner brackets only if you need them) ((2019)-(12)-(20)|(2019)-(12)-(21)|(2019)-(12)-(22)|...) and continue with all possible dates - and ask yourself: what would you do if you find such a regex in a project you have to work with ;)
Better solution (quick and dirty, there might be better solutions):
(?<yyyy>20[0-9]{2})\-(?<mm>[01][0-9]|[0-9])\-(?<dd>[0-3][0-9]|[0-9])
This way you have three named groups (yyyy, mm, dd) you can access and validate the matched values... The regex is smaller, you have a better association between code and regex and both are easier to maintain.

How to creating a regex pattern in VBA to extract dates from string and exclude false matches

I am trying to use Regex to parse a series of strings to extract one or more text dates that may be in multiple formats. The strings will look something like the following:
24 Aug 2016: nno-emvirt010a/b; 16 Aug 2016 nnt-emvirt010a/b nnd-emvirt010a/b COSI-1.6.5
24.16 nno-emvirt010a/b nnt-emvirt010a/b nnd-emvirt010a/b EI.01.02.03\
9/23/16: COSI-1.6.5 Logs updated at /vobs/COTS/1.6.5/files/Status_2016-07-27.log, Status_2016-07-28.log, Status_2016-08-05.log, Status_2016-08-08.log
I am not concerned about validating the individual date fields; just extracting the date string. The part I am unable to figure out is how to not match on number sequences that match the pattern but aren’t dates (‘1.6.5’ in ex. (1) and 01.02.03 in ex. (2)) and dates that are part of a file name (2016-07-27 in ex. (3)). In each of these exception cases in my input data, the initial numbers are preceded by either a period(.), underscore (_) or dash (-), but I cannot determine how to use this to edit the pattern syntax to not match these strings.
The pattern I have that partially works is below. It will only ignore the non date matches if it starts with 1 digit as in example 1.
/[^_\.\(\/]\d{1,4}[/\-\.\s*]([1-9]|0[1-9]|[12][0-9]|3[01]|[a-z]{3})[/\-\.\s*]\d{1,4}/ig`
I am not sure about vba check if this works . seems they have given so much options : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s04.html
^(?:(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])|↵
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9]))/(?:[0-9]{2})?[0-9]{2}$
^(?:
# m/d or mm/dd
(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])
|
# d/m or dd/mm
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9])
)
# /yy or /yyyy
/(?:[0-9]{2})?[0-9]{2}$
According to the test strings you've presented, you can use the following regex
See this regex in use here
(?<=[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
This regex ensures that specific date formats are met and are preceded by nothing (beginning of the string) or by a non-word character (specifically a-z, A-Z, 0-9) or dot .. The date formats that will be matched are:
24 Aug 2016
24.16
9/23/16
The regex could be further manipulated to ensure numbers are in the proper range according to days/month, etc., however, I don't feel that is really necessary.
Edits
Edit 1
Since VBA doesn't support lookbehinds, you can use the following. The date is in capture group 1.
(?:[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
Edit 2
As per bulbus's comment below
(?:[^\w.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d{2,4})|(?:(?:\d{‌1,2}\/){2}\d{2,4})|(‌​?:\d{2,4}(?:-\d{2}){‌​2})|\d{2}\.\d{2})
Took liberty to edit that a bit.
replaced [^a-zA-Z\d.] with [^\w.], comes with added advantage of excluding dates with _2016-07-28.log
Due to 1 removed trailing condition (?=[^a-zA-Z\d.]).
Forced year digits from \d+ to \d{2,4}
Edit 3
Due to added conditions of the regex, I've made the following edits (to improve upon both previous edits). As per the OP:
The edited pattern above works in all but 2 cases:
it does not find dates with the year first (ex. 2016/07/11)
if the date is contained within parenthesis in the string, it returns the left parenthesis as part of the date (ex. match = (8/20/2016)
Can you provide the edit to fix these?
In the below regexes, I've changed years to \d+ in order for it to work on any year greater than or equal to 0.
See the code in use here
(?:[^\w.]|^)((?:\d{1,2}\s+[A-Z][a-z]{2}\s+\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:\/\d{1,2}){2})|(?:\d+(?:-\d{2}){2})|\d{2}\.\d+)
This regex adds the possibility of dates in the XXXX/XX/XX format where the date may appear first.
The reason you are getting ( as a match before the regex is the nature of the Full Match. You need to, instead, grab the value of the first capture group and not the whole regex result. See this answer on how to grab submatches from a regex pattern in VBA.
Also, note that any additional date formats you need to catch need to be explicitly set in the regex. Currently, the regex supports the following date formats:
\d{1,2}\s+[A-Z][a-z]{2}\s+\d+
12 Apr 17
12 Apr 2017
(?:\d{1,2}\/){2}\d+
1/4/17
01/04/17
1/4/2017
01/04/2017
\d+(?:\/\d{1,2}){2}
17/04/01
2017/4/1
2017/04/01
17/4/1
\d+(?:-\d{2}){2}
17-04-01
2017-04-01
\d{2}\.\d+ - Although I'm not sure what this date format is even used for and how it could be considered efficient if it's missing month
24.16

VIN Validation RegEx

I have written a VIN validation RegEx based on the http://en.wikipedia.org/wiki/Vehicle_identification_number but then when I try to run some tests it is not accepting some valid VIN Numbers.
My RegEx:
^[A-HJ-NPR-Za-hj-npr-z\\d]{8}[\\dX][A-HJ-NPR-Za-hj-npr-z\\d]{2}\\d{6}$
VIN Number Not Working:
1ftfw1et4bfc45903
WP0ZZZ99ZTS392124
VIN Numbers Working:
19uya31581l000000
1hwa31aa5ae006086
(I think the problem occurs with the numbers at the end, Wikipedia made it sound like it would end with only 6 numbers and the one that is not working but is a valid number only ends with 5)
Any Help Correcting this issue would be greatly appreciated!
I can't help you with a perfect regex for VIN numbers -- but I can explain why this one is failing in your example of 1ftfw1et4bfc45903:
^[A-HJ-NPR-Za-hj-npr-z\d]{8}[\dX][A-HJ-NPR-Za-hj-npr-z\d]{2}\d{6}$
Explanation:
^[A-HJ-NPR-Za-hj-npr-z\d]{8}
This allows for 8 characters, composed of any digits and any letters except I, O, and Q; it properly finds the first 8 characters:
1ftfw1et
[\dX]
This allows for 1 character, either a digit or a capital X; it properly finds the next character:
4
[A-HJ-NPR-Za-hj-npr-z\d]{2}
This allows for 2 characters, composed of any digits and any letters except I, O, and Q; it properly finds the next 2 characters:
bf
\d{6}$
This allows for exactly 6 digits, and is the reason the regex fails; because the final 6 characters are not all digits:
c45903
Dan is correct - VINs have a checksum. You can't utilize that in regex, so the best you can do with regex is casting too wide of a net. By that I mean that your regex will accept all valid VINs, and also around a trillion (rough estimate) non-VIN 17-character strings.
If you are working in a language with named capture groups, you can extract that data as well.
So, if your goal is:
Only to not reject valid VINs (letting in invalid ones is ok) then use Fransisco's answer, [A-HJ-NPR-Z0-9]{17}.
Not reject valid VINs, and grab info like model year, plant code, etc, then use this (note, you must use a language that can support named capture groups - off the top of my head: Perl, Python, Elixir, almost certainly others but not JS): /^(?<wmi>[A-HJ-NPR-Z\d]{3})(?<vds>[A-HJ-NPR-Z\d]{5})(?<check>[\dX])(?<vis>(?<year>[A-HJ-NPR-Z\d])(?<plant>[A-HJ-NPR-Z\d])(?<seq>[A-HJ-NPR-Z\d]{6}))$/ where the names are defined at the end of this answer.
Not reject valid VINs, and prevent some but not all invalid VINs, you can get specific like Pedro does.
Only accept valid VINs: you need to write code (just kidding, GitHub exists).
Capture group name key:
wmi - World manufacturer identifier
vds - Vehicle descriptor section
check - Check digit
vis - Vehicle identifier section
year - Model year
plant - Plant code
seq - Production sequence number
This regular expression is working fine for validating US VINs, including the one you described:
[A-HJ-NPR-Z0-9]{17}
Remember to make it case insensitive with flag i
Source: https://github.com/rfink/angular-vin
VIN should have only A-Z, 0-9 characters, but not I, O, or Q
Last 6 characters of VIN should be a number
VIN should be 17 characters long
You didn't specify which language you're using but the following regex can be used to validate a US VIN with php:
/^(?:([A-HJ-NPR-Z]){3}|\d{3})(?1){2}\d{2}(?:(?1)|\d)(?:\d|X)(?:(?1)+\d+|\d+(?1)+)\d{6}$/i
I feel regex is not the ideal validation. VINs have a built in check digit. https://en.wikibooks.org/wiki/Vehicle_Identification_Numbers_(VIN_codes)/Check_digit or http://www.vsource.org/VFR-RVF_files/BVINcalc.htm
I suggest you build an algorithm using this. (Untested algorithm example)
This should work, it is from splunk search, so there are some additional exclusions**
(?i)(?<VIN>[A-Z0-9^IOQioq_]{11}\d{6})
The NHTSA website provides the method used to calculate the 9th character checksum, if you're interested. It also provides lots of other useful data, such as which characters are allowed in which position, or how to determine whether the 10th character, if alphabetic, refers to a model year up to 1999 or a model year from 2010.
NHTSA VIN eCFR
Hope that helps.
Please, use this regular expression. It is shorter and works with all VIN types
(?=.*\d|[A-Z])(?=.*[A-Z])[A-Z0-9]{17}
I changed above formula by new below formula
(?=.*\d|=.*[A-Z])(?=.*[A-Z])[A-Z0-9]{17}
This regular expression consider any letter but at leats one digit, max 17 characters