Implementing a negative lookbehind using PRX syntax in SAS - regex

I've begun using PRX code within SAS to identify free text phrases entered in a database I'm using. A typical phrase I'm identifying is: 'positive modified hodge test' or 'positive for modified hodge test'. These phrases are embedded within large strings of text at times. What I don't want to flag are phrases that say 'previous positive hodge test'. I've read some documentation to implement a negative lookbehind to NOT flag phrases that include "previous" but it's not doing what I had anticipated.
if prxmatch("/pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i") > 0 then hodge_id = 1;
The PRX code above will match all phrases below:
"positive modified hodge"
"previous positive hodge test"
"confirmed positive hodge carbapenemase"
"positive for modified hodge test"
"positive by the modified hodge"
if prxmatch("/pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i") > 0 then
hodge_id = 1; /* Without lookback */
if prxmatch("/(?<!previous)\s*pos\w+ (for)?(by)?\s?(the)?\s?
(modi|hod|mht)/i") > 0 then hodge_id = 1; /* With lookbook */
Using the negative lookback, I expect to flag:
"positive modified hodge"
"confirmed positive hodge carbapenemase"
"positive for modified hodge test"
"positive by the modified hodge"
but not:
"previous positive hodge test"
What happens is that it omits the phrase including "previous" but also the first phrase "positive modified hodge".
My PRX is in the beginning stages, so any advice in cleaning/simplifying it is appreciated.

you were pretty close.
/*
you need to have
(?<!previous\s) or (?<!previous)\s
instead of (?<!previous)\s*
*/
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and nice positive modified hodge wow so nice
this is wow confirmed positive hodge carbapenemase
now this positive for modified hodge test and later
cool is my name positive by the modified hodge hello
wow and wow previous positive hodge test
Mr cool
;
data want;
set have;
if _N_ = 1 then
do;
retain patternID;
pattern = "/(?<!previous\s)pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i";
patternID = prxparse(pattern);
end;
if prxmatch(patternID,string) > 0 then
hodge_id = 1;
else hodge_id =0;
drop pattern patternid;
run;

Related

Can regex queries be combined in SAS?

I've successfully implemented a negative lookback in my regex code in SAS. However, there are multiple 'words' that are possibilities that would negate the string I'm looking for. Specifically I'm looking for a phrase (from medical notes) that say "carbapenmase producing" or "carbapenamase confirmed" and at times these phrases can be preceded by "not carbapenemase producing" or "possible carbapenamase producing", and these I don't want. Having learned that negative lookbacks require the qualifier words (if > 1) to be of the same length, I need to create 2 separate regex expressions to capture "not" and "possible", as in:
*!!! Create template to identify key phrases in the comment/note;
retain carba1 carba2 carba3;
if _n_ = 1 then do; /*probable*/
carba1 = prxparse("/(?<!not\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
carba2 = prxparse("/(?<!possible|probable\s)ca[bepr]\w*?\s*?
(conf|posi|prod|\+)/i");
carba3 = prxparse("/(?<!not a\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
end;
if prxmatch(carba1,as_comments) > 0 or prxmatch(carba2,as_comments) > 0 or
prxmatch(carba3,as_comments) > 0;
Is there a word around for this that would shorten execution time, or am I stuck with this? Any advice/comments are appreciated.
if it has just 4 scenarios and they are straightforward. you can do this simple by using contains and not contains.
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and carbapenmase producing or wow so nice
this is wow confirmed carbapenamase confirmed hello
now this positive for modified hodge test and later
cool is my name not carbapenemase producing" or "the modified hodge hello
wow and wow previous possible carbapenamase producing hello
Mr cool is hello
;
data want;
set have;
where (string contains "carbapenmase producing" or
string contains "carbapenamase confirmed")
and not (string contains "not carbapenemase producing" or
string contains "possible carbapenamase producing");
run;

sas, regex, numbers, substring, prxchange

I need help with the below code. I do not see how this is extracting the number from this address line text. When it (the pattern) says s/\D/ / I thought this replaces the digits with a space. I know the second part here is taking the substring up to the first space in the address line text. But, then I do not see how this is extracting the numbers. I pulled up the data set and it looks like this does work. Please help me understand how this is working.
DATA OUT.REQ_1_2_03;
SET OUT.REQ_1_2_02;
/* GET STREET NUMBER*/
PRE_RCV_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1)));
POST_RCV_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(POST_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',POST_RCV_ADDRESSS_LINE_1)));
PRE_HOST_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(PRE_HOST_ADDR_LINE_1,1,PRXMATCH('/\s/',PRE_HOST_ADDR_LINE_1)));
POST_HOST_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(POST_HOST_ADDR_LINE_1,1,PRXMATCH('/\s/',POST_HOST_ADDR_LINE_1)));
RUN;
try to understand using an example
PRE_RCV_ADDRESSS_LINE_1 ="123hello Village st"
start from the left side of the code.
first use prxmatch and it finds first space(\s)that comes 123hello
do substr till that space and you get 123hello
then remove prxchanges to replace \D (that is anything other than digit) and
is converted to 123
to sum it up by example
"123hello Village st" -- find space(\s) by prxmatch and substring till space gives "123hello"
"123hello" is changed to "123" by prxchange which replaces anything other than digit(\D) .
/* run this step to understand it better*/
data want ;
PRE_RCV_ADDRESSS_LINE_1 = "123hello Village st";
test1= SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1));
PRE_RCV_ST_NB= PRXCHANGE('s/\D//',-1,SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1)));
run;

Retrieving certain numbers with regex

I need help getting certain numbers with a regex.
As I dont know much about regex I have only managed to see if the first two characters match with 95 - 99. ^([0-9][5-9]{4})
I have the numbers 00000 through 99999.
I want to exclude all the numbers that start with 95 and up.
So 00000 - 94999 is ok, 95000 - 99999 is not ok.
You could match the range of numbers 00000 - 94999 including leading zeroes you might try it like this:
^ From the beginning of the string
(?=\d{5}$) Start with a positive lookahead that makes sure that the number is not longer than 5 digits until the end of the string
0* Preprend with zero or more zeroes
(?:9[0-4][0-9]{3}|[1-8][0-9]{4}|[1-9][0-9]{1,3}|[0-9]) Match the range of numbers
$ The end of the string
Your regex could look like:
^(?=\d{5}$)0*(?:9[0-4][0-9]{3}|[1-8][0-9]{4}|[1-9][0-9]{1,3}|[0-9])$
A regex is for validating if a string matches the set pattern. It is not for comparing numbers to see if they are within range. Convert the text (if that is what you are starting from) to a number and then use comparison operators in an if statement.
Regex is not well suited to performing numeric comparisons, from both a readability and performance standpoint. It would be much more sensible to extract the number and perform a numeric compare afterwards.
You've not mentioned a language, so I'll demonstrate with Python3.
# input data
lines = [
"my line #1 :94995: message #1",
"my line #2 :95005: message #2"
]
for i, line in enumerate(x):
# extract a 5-digit number wrapped in colons
match = re.search(':([0-9]{5}):', line)
if match is None:
continue
# convert to a number, and verify
num = int(match.group(1))
if num >= 95000:
continue
# print any lines that meet our criteria
print("line %d meets our criteria! (%d)" % ( i, num ))
Will output:
line 0 meets our criteria! (94995)

Replacing any single digit in string with leading 0 in SAS

I have a variable with the values as t14-1-1, t14-1-1A, t14-2-1-1, t14-2-4-15A, etc as mentioned in the cards statement below.
What i need is to pad any single digit in the string with a leading 0, as we do it with sas format z2.
data test01;
input have $40.;
want02=prxchange('s/(^|-)\d($|-)*/\10\2/',-1,strip(have));
want03=prxchange('s/(^|-)\d($|-)*(.+)/\10\2/',-1,strip(have));
cards;
t14-1-1
t14-1-1A
t14-2-1-1
t14-2-1-1A
t14-2-4-15A
t14-2-4-15B
t14-2-4-16
t14-2-4-17
t14-2-4-17A
t14-2-4-17B
l16-2-9-1-1
l16-2-9-2-1
l16-2-9-2-2
;
run;
What I need is the following:
t14-01-01
t14-01-01A
t14-02-01-01
t14-02-01-01A
t14-02-04-15A
t14-02-04-15B
t14-02-04-16
t14-02-04-17
t14-02-04-17A
t14-02-04-17B
l16-02-09-01-01
l16-02-09-02-01
l16-02-09-02-02
I know I have a way of doing this with array and scan, length and tranward functions. I was just wondering if this can be done through prxchange (regular expression) in a few steps with less complexity.
I have tried a lot with different permutation and combinations with no luck.
Thanks for the help in Advance!
I don't know if SAS regex flavour supports lookarround, but, if it does, this should do the job:
search: (?<=-)(\d)(?!\d)
replace: 0$1
Where:
(?<=-) is a lookbehind that make sure we have a dash before
(\d) is a single digit captured in group 1
(?!\d) is a negative lookahead that make sure we have not digit after

How to modify string to a character value in which each character of the string is separated by a comma?

I came across this question today morning and I am still trying to figure out it can be done. the following dataset is present and has a character variable CAT.
CAT
A
AB
B
ABCD
CB
.
.
.
and so on.
We need to write a SAS program to introduce commas in-between each character of the string if the length of the string is more than 1. I used length() function and used a do loop to create different variables and it just got messy. How do i tackle this?
Regular expression solution:
data have;
input CAT $;
datalines;
A
AB
B
ABCD
CB
;;;;
run;
data want;
set have;
cat_c = prxchange('s/(?<=[[:alpha:]])([[:alpha:]])/,$1/io',-1,CAT);
put cat_c=;
run;
The first parenthetical group is a look-behind for an alpha character; then the captured alpha character. Then replace with comma and character. If you want something other than [[:alpha:]] (ie, A-Z) then supply that as a class.
The solution using length and do loop isn't bad, honestly, if you want something that is more readable to novice programmers. Just use SUBSTR left of the equal sign.
data want2;
set have;
if length(cat) > 1 then
do _t = 1 to length(cat)-1;
substr(cat_c,2*_t-1,2)=substr(cat,_t,1)||',';
end;
substr(cat_c,2*length(cat)-1,1)=substr(cat,length(cat),1);
put cat_c=;
run;