Prxmatch in SAS - using $ to limit results doesn't work - regex

I'm trying to use prxmatch to verify if postcode format (UK) is correct. The ('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/') bit covers (I think) all the possible post code formats used in UK, however I only want exact and not partial matches and no additional chars before or after match.
data pc_flag ; set abc ;
format pc_correct_flag $1. compressed_postcode $100.;
compressed_postcode = compress(postcode);
pc_regex = prxparse('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/');
if prxmatch(pc_regex,compressed_postcode)>0
then pc_correct_flag='Y';
else pc_correct_flag='N';run;
I was expecting 'Y' only on exact matches on full string, i.e. with no additional characters before and after regex. However, I'm also getting false positives, where a part of 'compressed_postcode' matches regex, but there are additional characters after the match, which I thought using $ would prevent.
I.e. I'd expect only something like AA11AA to match, but not AA11AAAA. I suspect this has to do with $ positioning but can't figure out exactly what's wrong. Any idea what I've missed?

SAS character variables contain trailing spaces out to the length of the variable. Either trim the value to be examined, or add \s*$ as the pattern termination.
if prxmatch(pc_regex,TRIM(compressed_postcode))>0 then …

Your regex is quite permissive - it allows every letter of the alphabet in every valid character position, so it matches quite a lot of strings that look like valid postcodes but do not exist as such, e.g. ZZ1 1ZZ.
I provided a more specific SAS-compatible postcode regex as an answer to another question - here's link in case this proves useful to you:
https://stackoverflow.com/a/43793562/667489
That one still matches some non-postcode strings, but it filters out any with characters on Royal Mail's blacklists for each position within the postcode.
As per Richard's answer, you need to trim the string being matched before applying the regex, or amend the regex to match extra trailing blanks.

Related

Regex Expression to extract county and zipcode

Given the below text, I want to extract county, state and zipcode i.e. BROWNSBURG, IN 46112.
With my current Regex Expression --
text = "BROWNSBURG, IN 46112 10 Other income (loss) 15 Alternative minimum tax (AMT) items"
regex = ([A-z]*[\S][\s]{1}[A-z]{2}[\d\s]+)
output = BROWNSBURG, IN 46112 10
It is extracting BROWNSBURG, IN 46112 10, I don't want this redundant 10. Can anyone please suggest the change in the above regex as it is working fine for most of the documents?
With only one example being provided, I will start out with assuming that the match you're looking for is always at the beginning of the line?
If so, it would be much safer to add the ^ anchor. Otherwise, you should remove it.
^[A-Z\s]+,\s[A-Z]{2}\s\d{5}
When we break down the pattern, you will see why this works:
^ asserts the beginning of the line (remove if necessary)
[A-Z\s]+ will match any letter or space that comes prior to the ,\s. The space is important in the event of counties/cities that contain more than one word.
[A-Z]{2} must match a 2-letter state code
Then finally, \d{5} will match on the 5-digit zip code.
Here is your custom view of your pattern in action.
Placing your pattern in a capturing group is unnecessary. You can simply return the full match, as it will be the same as the submatch. And while this one seems to be pretty simple, please understand that there are different implementations of Regular Expressions in different languages, so specifying the language in your question tags may prove to be useful in the future.

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Simple regex - finding words including numbers but only on occasion

I'm really bad at regex, I have:
/(#[A-Za-z-]+)/
which finds words after the # symbol in a textbox, however I need it to ignore email addresses, like:
foo#things.com
however it finds #things
I also need it to include numbers, like:
#He2foo
however it only finds the #He part.
Help is appreciated, and if you feel like explaining regex in simple terms, that'd be great :D
/(?:^|(?<=\s))#([A-Za-z0-9]+)(?=[.?]?\s)/
#This (matched) regex ignores#this but matches on #separate tokens as well as tokens at the end of a sentence like #this. or #this? (without picking the . or the ?) And yes email#addresses.com are ignored too.
The regex while matching on # also lets you quickly access what's after it (like userid in #userid) by picking up the regex group(1). Check PHP documentation on how to work with regex groups.
You can just add 0-9 to your regex, like so:
/(#[A-Za-z0-9-]+)/
Don't think any more explanation is needed since you've been able to come this far by yourself. 0-9 is just like a-z (though numeric ofcourse).
In order to ignore emailaddresses you will need to provide more specific requirements. You could try preceding # with (^| ) which basically states that your value MUST be preceeded by either the start of the string (so nothing really, though at the start) or a space.
Extending this you can also use ($| ) on the end to require the value to be followed by the end of the string or a space (which means there's no period allowed, which is requirement for a valid emailaddress).
Update
$subject = "#a #b a#b a# #b";
preg_match_all("/(^| )#[A-Za-z0-9-]+/", $subject, $matches);
print_r($matches[0]);

Regex to extract date with negative lookahead

I am using this pattern to extract confirmation dates from a text file and converting them to a date object (see my post here Extract/convert date from string in MS Access).
The current pattern matches all strings that look like a date, but may not be the confirmation date (which is always preceded by Confirmed by), and moreover, may not have complete date information (e.g. no AM or PM).
Pattern: (\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)
Sample text:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
The above pattern matches the following:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
^^^^^^^^^^^^^^^^^^^^
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
^^^^^^^^^^^^^^^^^^^^
I want the pattern to look for the date in the segment of the text file that begins with Confirmed by and ends with a semi-colon. Also, in order to properly convert the time, the pattern should match only AM or PM at the end. How can I restrict the pattern to this segment and add the additional AM or PM criteria?
Can anyone help?
In order to match the end of the string, use $ at the end of your regex. To match the entire phrase "Confirmed by <someone> on <date>", use plain text (remember that plain text can be used in a regex as well -- if you aren't using special characters, the matcher will match your query verbatim). You need to use a negative look-ahead to exclude entire words.So maybe something like this:
Confirmed by (?!\ on\ )(\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)$
Which will allow you to match a string that starts with "Confirmed by", followed by anything except for " on ", followed by the date that you capture, and the end of the string.
Edit: the negative look-ahead part is tricky, look at the answer below for more reference:
A regular expression to exclude a word/string
I don't see any need for a lookahead here, positive or negative. This works correctly on your sample string:
Confirmed by [^;]*(\d+/\d+/\d+\s+\d+:\d+:\d+(?:\s+(?:AM|PM))?|\d+-\w+-\d+\s+\d+:\d+:\d+);
The [^;]* effectively corrals the match between a Confirmed by sequence and its closing semicolon. (I'm assuming the semicolon will always be present.)
+(?:\s+(?:AM|PM))? makes the AM/PM optional, along with its leading whitespace.
The actual date will be stored in capturing group #1.
Try this:
(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));
The simplest answer is more than often a good enough solution. By turning of the default greedy behavior (using the question mark: .*?) the regular expression will instead try to find the shortest match that matches the pattern. A pattern never matches the same string more than once, this means that each Confirmed by can only be coupled with one date which in this case is the next to follow.
Confirmed by.*?(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));