Regex to extract date with negative lookahead - regex

I am using this pattern to extract confirmation dates from a text file and converting them to a date object (see my post here Extract/convert date from string in MS Access).
The current pattern matches all strings that look like a date, but may not be the confirmation date (which is always preceded by Confirmed by), and moreover, may not have complete date information (e.g. no AM or PM).
Pattern: (\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)
Sample text:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
The above pattern matches the following:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
^^^^^^^^^^^^^^^^^^^^
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
^^^^^^^^^^^^^^^^^^^^
I want the pattern to look for the date in the segment of the text file that begins with Confirmed by and ends with a semi-colon. Also, in order to properly convert the time, the pattern should match only AM or PM at the end. How can I restrict the pattern to this segment and add the additional AM or PM criteria?
Can anyone help?

In order to match the end of the string, use $ at the end of your regex. To match the entire phrase "Confirmed by <someone> on <date>", use plain text (remember that plain text can be used in a regex as well -- if you aren't using special characters, the matcher will match your query verbatim). You need to use a negative look-ahead to exclude entire words.So maybe something like this:
Confirmed by (?!\ on\ )(\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)$
Which will allow you to match a string that starts with "Confirmed by", followed by anything except for " on ", followed by the date that you capture, and the end of the string.
Edit: the negative look-ahead part is tricky, look at the answer below for more reference:
A regular expression to exclude a word/string

I don't see any need for a lookahead here, positive or negative. This works correctly on your sample string:
Confirmed by [^;]*(\d+/\d+/\d+\s+\d+:\d+:\d+(?:\s+(?:AM|PM))?|\d+-\w+-\d+\s+\d+:\d+:\d+);
The [^;]* effectively corrals the match between a Confirmed by sequence and its closing semicolon. (I'm assuming the semicolon will always be present.)
+(?:\s+(?:AM|PM))? makes the AM/PM optional, along with its leading whitespace.
The actual date will be stored in capturing group #1.

Try this:
(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));

The simplest answer is more than often a good enough solution. By turning of the default greedy behavior (using the question mark: .*?) the regular expression will instead try to find the shortest match that matches the pattern. A pattern never matches the same string more than once, this means that each Confirmed by can only be coupled with one date which in this case is the next to follow.
Confirmed by.*?(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));

Related

Prxmatch in SAS - using $ to limit results doesn't work

I'm trying to use prxmatch to verify if postcode format (UK) is correct. The ('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/') bit covers (I think) all the possible post code formats used in UK, however I only want exact and not partial matches and no additional chars before or after match.
data pc_flag ; set abc ;
format pc_correct_flag $1. compressed_postcode $100.;
compressed_postcode = compress(postcode);
pc_regex = prxparse('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/');
if prxmatch(pc_regex,compressed_postcode)>0
then pc_correct_flag='Y';
else pc_correct_flag='N';run;
I was expecting 'Y' only on exact matches on full string, i.e. with no additional characters before and after regex. However, I'm also getting false positives, where a part of 'compressed_postcode' matches regex, but there are additional characters after the match, which I thought using $ would prevent.
I.e. I'd expect only something like AA11AA to match, but not AA11AAAA. I suspect this has to do with $ positioning but can't figure out exactly what's wrong. Any idea what I've missed?
SAS character variables contain trailing spaces out to the length of the variable. Either trim the value to be examined, or add \s*$ as the pattern termination.
if prxmatch(pc_regex,TRIM(compressed_postcode))>0 then …
Your regex is quite permissive - it allows every letter of the alphabet in every valid character position, so it matches quite a lot of strings that look like valid postcodes but do not exist as such, e.g. ZZ1 1ZZ.
I provided a more specific SAS-compatible postcode regex as an answer to another question - here's link in case this proves useful to you:
https://stackoverflow.com/a/43793562/667489
That one still matches some non-postcode strings, but it filters out any with characters on Royal Mail's blacklists for each position within the postcode.
As per Richard's answer, you need to trim the string being matched before applying the regex, or amend the regex to match extra trailing blanks.

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

Regular expression: matching part of words [duplicate]

I'm trying to make a Regex that matches this string {Date HH:MM:ss}, but here's the trick: HH, MM and ss are optional, but it needs to be "HH", not just "H" (the same thing applies to MM and ss). If a single "H" shows up, the string shouldn't be matched.
I know I can use H{2} to match HH, but I can't seem to use that functionality plus the ? to match zero or one time (zero because it's optional, and one time max).
So far I'm doing this (which is obviously not working):
Regex dateRegex = new Regex(#"\{Date H{2}?:M{2}?:s{2}?\}");
Next question. Now that I have the match on the first string, I want to take only the HH:MM:ss part and put it in another string (that will be the format for a TimeStamp object).
I used the same approach, like this:
Regex dateFormatRegex = new Regex(#"(HH)?:?(MM)?:?(ss)?");
But when I try that on "{Date HH:MM}" I don't get any matches. Why?
If I add a space like this Regex dateFormatRegex = new Regex(#" (HH)?:?(MM)?:?(ss)?");, I have the result, but I don't want the space...
I thought that the first parenthesis needed to be escaped, but \( won't work in this case. I guess because it's not a parenthesis that is part of the string to match, but a key-character.
(H{2})? matches zero or two H characters.
However, in your case, writing it twice would be more readable:
Regex dateRegex = new Regex(#"\{Date (HH)?:(MM)?:(ss)?\}");
Besides that, make sure there are no functions available for whatever you are trying to do. Parsing dates is pretty common and most programming languages have functions in their standard library - I'd almost bet 1k of my reputation that .NET has such functions, too.
In your edit you mention an unwanted leading space in the result… to check a leading or trailing condition together with your regex without including this to the result you can use lookaround feature of regex.
new Regex(#"(?<=Date )(HH)?:?(MM)?:?(ss)?")
(?<=...) is a lookbehind pattern.
Regex test site with this example.
For input Date HH:MM:ss, it will match both regexes (with or without lookbehind).
But input FooBar HH:MM:ss will still match a simple regex, but the lookbehind will fail here. Lookaround doesn't change the content of the result, but it prevents false matches (e.g., this second input that is not a Date).
Find more information on regex and lookaround here.

Regular expression for repeated sequence

i am a learner of regular expressions. I am trying to find the date from the below string. The element <ext:serviceitem> can be repeated upto 20 times in actual xml. I need to take out only the date strings (like any element ending with Date in its name, i need that element's value which is a date). For example and . I want all those dates (only) to be printed out.
<ext:serviceitem><ext:name>EnhancedSupport</ext:name><ext:serviceItemData><ext:serviceItemAttribute name="Name">E69D7F93-81F4-09E2-E043-9D3226AD8E1D-1</ext:serviceItemAttribute><ext:serviceItemAttribute name="ProductionDatabase">P1APRD</ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportType">Monthly</ext:serviceItemAttribute><ext:serviceItemAttribute name="Environment">DV1</ext:serviceItemAttribute><ext:serviceItemAttribute name="StartDate">2013-11-04 10:02</ext:serviceItemAttribute><ext:serviceItemAttribute name="EndDate">2013-11-12 10:02</ext:serviceItemAttribute><ext:serviceItemAttribute name="No_of_WeeksSupported"></ext:serviceItemAttribute><ext:serviceItemAttribute name="Cost"></ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportNotes"></ext:serviceItemAttribute><ext:serviceItemAttribute name="FiscalQuarterNumber"></ext:serviceItemAttribute><ext:subscription><ext:loginID>kbasavar</ext:loginID><ext:ouname>020072748</ext:ouname></ext:subscription></ext:serviceItemData></ext:serviceitem><ext:serviceitem><ext:name>EnhancedSupport</ext:name><ext:serviceItemData><ext:serviceItemAttribute name="Name">E69D7F93-81F4-09E2-E043-9D3226AD8E1D-2</ext:serviceItemAttribute><ext:serviceItemAttribute name="ProductionDatabase">P1BPRD</ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportType">Quarterly</ext:serviceItemAttribute><ext:serviceItemAttribute name="Environment">TS2</ext:serviceItemAttribute><ext:serviceItemAttribute name="StartDate">2013-11-11 10:03</ext:serviceItemAttribute><ext:serviceItemAttribute name="EndDate">2013-11-28 10:03</ext:serviceItemAttribute><ext:serviceItemAttribute name="No_of_WeeksSupported"></ext:serviceItemAttribute><ext:serviceItemAttribute name="Cost"></ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportNotes"></ext:serviceItemAttribute><ext:serviceItemAttribute name="FiscalQuarterNumber"></ext:serviceItemAttribute><ext:subscription><ext:loginID>kbasavar</ext:loginID><ext:ouname>020072748</ext:ouname></ext:subscription></ext:serviceItemData></ext:serviceitem>
I tried with below regex, but its returning rest of the string after the first occurence.
(?<=Date\"\>).*(?=\<\/ext\:serviceItemAttribute\>)
Any help would be highly appreciated.
Your problem is that .* is greedy, meaning that it will grab from the first instance of Date to the last instance of </ext:ser..... Replace the .* with .*? and it will alter the behaviour to what you're after.
#(?<=Date">).*?(?=</ext:serviceItemAttribute>)#i
You should have .*? in a capture group: (.*?).
#(?<=Date">)(.*?)(?=</ext:serviceItemAttribute>)#i
You could also do it - more simply - like:
#Date">(.*?)</ext#i
Update
As has been pointed out in the comment below this (above) solution relies on the use of non-greedy matching.
To get around this you could use the following: ([^<]*) instead of (.*?)
NOTE: This does not impact the alternatives below.
Alternatives
/(\d{4}-\d{2}-\d{2})/
/(\d{4}-\d{2}-\d{2} \d{2}:\d{2})/
The above patterns will match dates in the format YYYY-XX-XX and YYYY-XX-XX HH:MM respectively

Matching anything else but a date

I'm using the REGEX below to effectively check if a string is a YYYY-MM-DD date.
[0-9]{4}-[0-9]{2}-[0-9]{2}
How do I do the reverse and use a similar REGEX to check a string is NOT a date in this format.
You could use a negative look-ahead to solve this:
(?![0-9]{4}-[0-9]{2}-[0-9]{2})
Edit: After watching the post by Lucasus I've made a new regex to have a more strict validation.
Year can be any combination of 4 digits, which e.g. allows dates pre 1900
The month is in the range of 1-12
Days in the range of 1-31,
Validates in the format YYYY-MM-DD
New Regex:
(?!([0-9]{4})-([1-9]|0[1-9]|1[012])-([1-9]|0[1-9]|[12][0-9]|3[01]))
Your regex matches more than only a valid date (for example "3333-99-99"), You can use a longer expression:
^(19[0-9][0-9]|20[0-9][0-9])-([1-9]|0[1-9]|1[012])-([1-9]|0[1-9]|[12][0-9]|3[01])$
If you want match everything but the date, use negative look-ahead, as Marcus wrote:
^(?!((19[0-9][0-9]|20[0-9][0-9])-([1-9]|0[1-9]|1[012])-([1-9]|0[1-9]|[12][0-9]|3[01])))$
The regex is from this link
import datetime
try:
datetime.datetime.strptime('2011-10-27','%Y-%m-%d')
except ValueError:
print 'The string is NOT in the right format'
else:
print 'The string is in the right format'
Indeed this isn't regex, but it might perform better - may be worth benchmarking...
First, your regex doesn't ensure that the string is a date, just that it contains one. If you wanted to make sure the string contains nothing but a date (according to your formulation), you would need to anchor it:
^[0-9]{4}-[0-9]{2}-[0-9]{2}$
...and the simplest way to ensure that the string that is not a date is to try to match it with that regex and negate the result. How you do that depends on the language; in C# you could do this:
if ( !Regex.IsMatch(s, "^[0-9]{4}-[0-9]{2}-[0-9]{2}$") )
If you must do the check with the regex itself (for example, if you're using a simple regex-based validation control), you can use a negative lookahead, as other responders advised:
^(?![0-9]{4}-[0-9]{2}-[0-9]{2}$).*
Starting from the beginning of the string (because of the ^), the lookahead tries to match a date followed by the end of the string ($). If that fails, the match position is reset to the beginning, and the .* goes ahead and consumes the whole string.