I've successfully implemented a negative lookback in my regex code in SAS. However, there are multiple 'words' that are possibilities that would negate the string I'm looking for. Specifically I'm looking for a phrase (from medical notes) that say "carbapenmase producing" or "carbapenamase confirmed" and at times these phrases can be preceded by "not carbapenemase producing" or "possible carbapenamase producing", and these I don't want. Having learned that negative lookbacks require the qualifier words (if > 1) to be of the same length, I need to create 2 separate regex expressions to capture "not" and "possible", as in:
*!!! Create template to identify key phrases in the comment/note;
retain carba1 carba2 carba3;
if _n_ = 1 then do; /*probable*/
carba1 = prxparse("/(?<!not\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
carba2 = prxparse("/(?<!possible|probable\s)ca[bepr]\w*?\s*?
(conf|posi|prod|\+)/i");
carba3 = prxparse("/(?<!not a\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
end;
if prxmatch(carba1,as_comments) > 0 or prxmatch(carba2,as_comments) > 0 or
prxmatch(carba3,as_comments) > 0;
Is there a word around for this that would shorten execution time, or am I stuck with this? Any advice/comments are appreciated.
if it has just 4 scenarios and they are straightforward. you can do this simple by using contains and not contains.
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and carbapenmase producing or wow so nice
this is wow confirmed carbapenamase confirmed hello
now this positive for modified hodge test and later
cool is my name not carbapenemase producing" or "the modified hodge hello
wow and wow previous possible carbapenamase producing hello
Mr cool is hello
;
data want;
set have;
where (string contains "carbapenmase producing" or
string contains "carbapenamase confirmed")
and not (string contains "not carbapenemase producing" or
string contains "possible carbapenamase producing");
run;
Related
Using SAS, I have a table with sentences and I am looking to find the rows in the table where the keyword is found in the sentence making use of fuzzy matching (complev function). Is there a way in SAS to find the keyword string in the sentences? I know how to use complev, but I only can use it to compare complete strings, not a string as a part of a larger string. For this example table the keyword would be 'example' and the result of the comparison would be in the column Result.
Thanks for your ideas!
This is an Example sentence : 1
Here is another one : 0
Also an exmple : 1
The examples keep coming : 1
No worries : 0
See if you can use this as a template. I compare the Complev value to three, but you can set it to any fitting value.
data have;
input string $ 1-25;
datalines;
Example sentence
Here is another one
Also an exmple
The examples keep coming
No worries
;
data want;
set have;
result = 0;
do _N_ = 1 to countw(string);
if complev('example', scan(string, _N_)) < 3 then do;
result=1; leave;
end;
end;
run;
EDIT: Use complev('example', scan(string, _N_), 'i') if you want the comparison the be case insensitive.
Not sure if this possible in SAS; although I'm slowly learning pretty much anything is possible in SAS...
I have a data-set of 600 patients and within that data-set I have a comment variable. The comment variable contains a few sentences each patient stated about his/her care. So for example, the data set looks like this:
ID Comment
1 Today we have great service. everyone was really nice.
2 The customer service team did not know what they were talking about and was rude.
3 Everyone was very helpful 5 stars.
4 Not very helpful at all.
5 Staff was nice.
6 All the people was really nice.
Lets say I identify a number of key words I'm interested in; for example nice, rude and helpful.
Is there a way to pull 2 strings that come before these words and produce a frequency table?
WORD Frequency
Was Really Nice 2
And Was Rude 1
Was Very Helpful 1
Not very helpful 1
I have a code written already which will help me to identify the key words, this code creates a count of the freq of each word within the comment variable.
data PG_2 / view=PG_2;
length word $20;
set PG_1;
do i = 1 by 1 until(missing(word));
word = upcase(scan(COMMENT, i));
if not missing(word) then output;
end;
keep word;
run;
proc freq data=PG_2 order=freq;
table word / out=wordfreq(drop=percent);
run;
Have you looked at the perl regular expression (PRX) functions in SAS. I think they might solve your issue.
You can use RegEx capture groups to pull out the two words directly before your keyword using prxparse and prxposn. The below should grab any two words before the word nice in the comment variable and add them to the firstTwoStrings variable.
data firstTwoStrings;
length firstTwoStrings $200;
retain re;
if _N_ = 1 then
re = prxparse('/(\w+ \w+) nice/'); /*change 'nice' to your desired keyword*/
set comments;
if prxmatch(re, COMMENT) then
do;
firstTwoStrings = prxposn(re, 1, COMMENT);
end;
run;
I've begun using PRX code within SAS to identify free text phrases entered in a database I'm using. A typical phrase I'm identifying is: 'positive modified hodge test' or 'positive for modified hodge test'. These phrases are embedded within large strings of text at times. What I don't want to flag are phrases that say 'previous positive hodge test'. I've read some documentation to implement a negative lookbehind to NOT flag phrases that include "previous" but it's not doing what I had anticipated.
if prxmatch("/pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i") > 0 then hodge_id = 1;
The PRX code above will match all phrases below:
"positive modified hodge"
"previous positive hodge test"
"confirmed positive hodge carbapenemase"
"positive for modified hodge test"
"positive by the modified hodge"
if prxmatch("/pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i") > 0 then
hodge_id = 1; /* Without lookback */
if prxmatch("/(?<!previous)\s*pos\w+ (for)?(by)?\s?(the)?\s?
(modi|hod|mht)/i") > 0 then hodge_id = 1; /* With lookbook */
Using the negative lookback, I expect to flag:
"positive modified hodge"
"confirmed positive hodge carbapenemase"
"positive for modified hodge test"
"positive by the modified hodge"
but not:
"previous positive hodge test"
What happens is that it omits the phrase including "previous" but also the first phrase "positive modified hodge".
My PRX is in the beginning stages, so any advice in cleaning/simplifying it is appreciated.
you were pretty close.
/*
you need to have
(?<!previous\s) or (?<!previous)\s
instead of (?<!previous)\s*
*/
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and nice positive modified hodge wow so nice
this is wow confirmed positive hodge carbapenemase
now this positive for modified hodge test and later
cool is my name positive by the modified hodge hello
wow and wow previous positive hodge test
Mr cool
;
data want;
set have;
if _N_ = 1 then
do;
retain patternID;
pattern = "/(?<!previous\s)pos\w+ (for)?(by)?\s?(the)?\s?(modi|hod|mht)/i";
patternID = prxparse(pattern);
end;
if prxmatch(patternID,string) > 0 then
hodge_id = 1;
else hodge_id =0;
drop pattern patternid;
run;
I have a table that looks similar to this:
A | B
1234|A1B2C
1124|$1n7
1342|*6675
1189|966
I need to create a column C where it takes the data from column B and replaces all non numeric characters with a "9" and makes each one 5 characters long by adding 0's to the front. It should come out like this:
91929
09197
96675
00966
Any assistance would be much appreciated, Thank you!
Edit: Sorry first time posting on any forum like this and got a bit ahead of myself, I created the table using SQL to pull data from 3 other tables and am a bit more familiar with SQL than SAS, which I have only been using for a few weeks. I have tried using COMPRESS but as I read more about that it seem like it only removes the values, so I tried TRANWRD but from what I was able to figure out I would need to create an entry for each letter and symbol that could appear, ie.
data Work.temp;
str = b;
Alpha=tranwrd(str, "a", "9");
Alpha=tranwrd(str, "b", "9");
put Alpha;
run;
so then I researched some more and found SAS replace character in ALL columns
based on that I used this code:
data temp;
set work.temp;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","9"));
end;
drop i;
run;
That just returns:
|Str|B|Alpha|
|---.|-.|.-------|
(sorry about the bad formatting there, spent 30 min trying to figure out how to make the table look right with spaces but kept coming out wrong. Please imagine the -'s are spaces)
again any help would be appreciated, Thank you!
try this.
data test;
input var1 $5.;
datalines;
A1B2C
$1n7
*6675
966
;
run;
data test1;
set test;
length var2 $5.;
regex = prxparse ("s/[^0-9|\s]/9/"); /*holds the regular expression you want to use to substitute the non-number characters*/
var2 = prxchange (regex, -1, var1); /*use this function to substitute all instances of the pattern*/
var3 = put (input (var2, best5.), z5.); /*use input and put to pad the front of the variable with 0s*/
run;
Good luck.
Keeping only the digits is simple. Use the modifiers on the COMPRESS() function.
c=compress(b,,'kd');
Padding on the left with zeros there are a number of ways to do that.
You could convert the digits to a number then write it back to a string use the Z format.
c=put(input(c,??5.),Z5.);
You could add the zeros. Using IF statement:
if length(c) < 5 then c=repeat('0',5-length(c)-1)||c ;
Or using SUBSTRN() function.
c=substrn('00000',1,5-length(c))||c;
Or have some fun with the REVERSE() function.
c=reverse(substr(reverse(cats('00000',c)),1,5));
I came across this question today morning and I am still trying to figure out it can be done. the following dataset is present and has a character variable CAT.
CAT
A
AB
B
ABCD
CB
.
.
.
and so on.
We need to write a SAS program to introduce commas in-between each character of the string if the length of the string is more than 1. I used length() function and used a do loop to create different variables and it just got messy. How do i tackle this?
Regular expression solution:
data have;
input CAT $;
datalines;
A
AB
B
ABCD
CB
;;;;
run;
data want;
set have;
cat_c = prxchange('s/(?<=[[:alpha:]])([[:alpha:]])/,$1/io',-1,CAT);
put cat_c=;
run;
The first parenthetical group is a look-behind for an alpha character; then the captured alpha character. Then replace with comma and character. If you want something other than [[:alpha:]] (ie, A-Z) then supply that as a class.
The solution using length and do loop isn't bad, honestly, if you want something that is more readable to novice programmers. Just use SUBSTR left of the equal sign.
data want2;
set have;
if length(cat) > 1 then
do _t = 1 to length(cat)-1;
substr(cat_c,2*_t-1,2)=substr(cat,_t,1)||',';
end;
substr(cat_c,2*length(cat)-1,1)=substr(cat,length(cat),1);
put cat_c=;
run;