Not sure if this possible in SAS; although I'm slowly learning pretty much anything is possible in SAS...
I have a data-set of 600 patients and within that data-set I have a comment variable. The comment variable contains a few sentences each patient stated about his/her care. So for example, the data set looks like this:
ID Comment
1 Today we have great service. everyone was really nice.
2 The customer service team did not know what they were talking about and was rude.
3 Everyone was very helpful 5 stars.
4 Not very helpful at all.
5 Staff was nice.
6 All the people was really nice.
Lets say I identify a number of key words I'm interested in; for example nice, rude and helpful.
Is there a way to pull 2 strings that come before these words and produce a frequency table?
WORD Frequency
Was Really Nice 2
And Was Rude 1
Was Very Helpful 1
Not very helpful 1
I have a code written already which will help me to identify the key words, this code creates a count of the freq of each word within the comment variable.
data PG_2 / view=PG_2;
length word $20;
set PG_1;
do i = 1 by 1 until(missing(word));
word = upcase(scan(COMMENT, i));
if not missing(word) then output;
end;
keep word;
run;
proc freq data=PG_2 order=freq;
table word / out=wordfreq(drop=percent);
run;
Have you looked at the perl regular expression (PRX) functions in SAS. I think they might solve your issue.
You can use RegEx capture groups to pull out the two words directly before your keyword using prxparse and prxposn. The below should grab any two words before the word nice in the comment variable and add them to the firstTwoStrings variable.
data firstTwoStrings;
length firstTwoStrings $200;
retain re;
if _N_ = 1 then
re = prxparse('/(\w+ \w+) nice/'); /*change 'nice' to your desired keyword*/
set comments;
if prxmatch(re, COMMENT) then
do;
firstTwoStrings = prxposn(re, 1, COMMENT);
end;
run;
Related
Using SAS, I have a table with sentences and I am looking to find the rows in the table where the keyword is found in the sentence making use of fuzzy matching (complev function). Is there a way in SAS to find the keyword string in the sentences? I know how to use complev, but I only can use it to compare complete strings, not a string as a part of a larger string. For this example table the keyword would be 'example' and the result of the comparison would be in the column Result.
Thanks for your ideas!
This is an Example sentence : 1
Here is another one : 0
Also an exmple : 1
The examples keep coming : 1
No worries : 0
See if you can use this as a template. I compare the Complev value to three, but you can set it to any fitting value.
data have;
input string $ 1-25;
datalines;
Example sentence
Here is another one
Also an exmple
The examples keep coming
No worries
;
data want;
set have;
result = 0;
do _N_ = 1 to countw(string);
if complev('example', scan(string, _N_)) < 3 then do;
result=1; leave;
end;
end;
run;
EDIT: Use complev('example', scan(string, _N_), 'i') if you want the comparison the be case insensitive.
I've successfully implemented a negative lookback in my regex code in SAS. However, there are multiple 'words' that are possibilities that would negate the string I'm looking for. Specifically I'm looking for a phrase (from medical notes) that say "carbapenmase producing" or "carbapenamase confirmed" and at times these phrases can be preceded by "not carbapenemase producing" or "possible carbapenamase producing", and these I don't want. Having learned that negative lookbacks require the qualifier words (if > 1) to be of the same length, I need to create 2 separate regex expressions to capture "not" and "possible", as in:
*!!! Create template to identify key phrases in the comment/note;
retain carba1 carba2 carba3;
if _n_ = 1 then do; /*probable*/
carba1 = prxparse("/(?<!not\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
carba2 = prxparse("/(?<!possible|probable\s)ca[bepr]\w*?\s*?
(conf|posi|prod|\+)/i");
carba3 = prxparse("/(?<!not a\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
end;
if prxmatch(carba1,as_comments) > 0 or prxmatch(carba2,as_comments) > 0 or
prxmatch(carba3,as_comments) > 0;
Is there a word around for this that would shorten execution time, or am I stuck with this? Any advice/comments are appreciated.
if it has just 4 scenarios and they are straightforward. you can do this simple by using contains and not contains.
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and carbapenmase producing or wow so nice
this is wow confirmed carbapenamase confirmed hello
now this positive for modified hodge test and later
cool is my name not carbapenemase producing" or "the modified hodge hello
wow and wow previous possible carbapenamase producing hello
Mr cool is hello
;
data want;
set have;
where (string contains "carbapenmase producing" or
string contains "carbapenamase confirmed")
and not (string contains "not carbapenemase producing" or
string contains "possible carbapenamase producing");
run;
I have a table that looks similar to this:
A | B
1234|A1B2C
1124|$1n7
1342|*6675
1189|966
I need to create a column C where it takes the data from column B and replaces all non numeric characters with a "9" and makes each one 5 characters long by adding 0's to the front. It should come out like this:
91929
09197
96675
00966
Any assistance would be much appreciated, Thank you!
Edit: Sorry first time posting on any forum like this and got a bit ahead of myself, I created the table using SQL to pull data from 3 other tables and am a bit more familiar with SQL than SAS, which I have only been using for a few weeks. I have tried using COMPRESS but as I read more about that it seem like it only removes the values, so I tried TRANWRD but from what I was able to figure out I would need to create an entry for each letter and symbol that could appear, ie.
data Work.temp;
str = b;
Alpha=tranwrd(str, "a", "9");
Alpha=tranwrd(str, "b", "9");
put Alpha;
run;
so then I researched some more and found SAS replace character in ALL columns
based on that I used this code:
data temp;
set work.temp;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","9"));
end;
drop i;
run;
That just returns:
|Str|B|Alpha|
|---.|-.|.-------|
(sorry about the bad formatting there, spent 30 min trying to figure out how to make the table look right with spaces but kept coming out wrong. Please imagine the -'s are spaces)
again any help would be appreciated, Thank you!
try this.
data test;
input var1 $5.;
datalines;
A1B2C
$1n7
*6675
966
;
run;
data test1;
set test;
length var2 $5.;
regex = prxparse ("s/[^0-9|\s]/9/"); /*holds the regular expression you want to use to substitute the non-number characters*/
var2 = prxchange (regex, -1, var1); /*use this function to substitute all instances of the pattern*/
var3 = put (input (var2, best5.), z5.); /*use input and put to pad the front of the variable with 0s*/
run;
Good luck.
Keeping only the digits is simple. Use the modifiers on the COMPRESS() function.
c=compress(b,,'kd');
Padding on the left with zeros there are a number of ways to do that.
You could convert the digits to a number then write it back to a string use the Z format.
c=put(input(c,??5.),Z5.);
You could add the zeros. Using IF statement:
if length(c) < 5 then c=repeat('0',5-length(c)-1)||c ;
Or using SUBSTRN() function.
c=substrn('00000',1,5-length(c))||c;
Or have some fun with the REVERSE() function.
c=reverse(substr(reverse(cats('00000',c)),1,5));
I am using the notalnum function in SAS. The input is a db field. Now, the function is returning a value that tells me there is a special character at the end of every string.
It is not a space character, because I have used COMPRESS function on the input field.
How can I print the ACII value of the special character at the end of each string?
The $HEX. format is the easiest way to see what they are:
data have;
var="Something With A Special Char"||'0D'x;
run;
data _null_;
set have;
rul=repeat('1 2 3 4 5 6 7 8 9 0 ',3); *so we can easily see what char is what;
put rul=;
put var= $HEX.;
run;
You can also use the c option on compress (var=compress(var,,'c');) to compress out control characters (which are often the ones you're going to run into in these situations).
Finally - 'A0'x is a good one to add to the list, the non-breaking space, if your data comes from the web.
If you want to see the position of the character within the ascii table you can use the rank() function, e.g.:
data _null_;
string = 'abc123';
do i = 1 to length(string);
asc = rank(substr(string,i,1));
put i= asc=;
end;
run;
Gives:
i=1 asc=97
i=2 asc=98
i=3 asc=99
i=4 asc=49
i=5 asc=50
i=6 asc=51
Joe's solution is very elegant, but seeing as my hex->decimal conversion skills are pretty poor I tend to do it this way.
Can anyone help me to resolve this?
I have a very large raw dataset with a character variable that contains text strings along with numbers & dates defined in character format. Now I want to process the dataset and create a new numeric variable and populate values only when the text in the actual variable is either a number or a date value. Otherwise missing
RAWDATA:
ACTUAL_VARIABLE NEW_NUM_VARIABLE(Expected Values)
------------------ ---------------------------------
ODed on pills threw them all up - 2006
Y
1 1
5 5
ODed on pills
6 6
Less than once a week
N
N
2006-11-12 2006-11-12
Many Thanks in Advance
The easy way to do it (if you know the specific date format) is to use the input function.
09:27
If put(input(var,??yymmdd10.),yymmdd10.)=var then its a date!
else if input(var,best.) ne . then its a number.
Otherwiseits a character string.
This isn't as straightforward as it first looks, so I understand why it would be difficult to search for an answer. Just extracting a number is pretty easy, but when dates are included it becomes a bit more complicated (particularly when the format entered could change, e.g. yyyy-mm-dd, dd-mm-yyyy, dd/mm/yy etc).
One thing to note first. If you want to store the new values as a numeric field then you can't show a mix of numbers and dates. Dates are stored as numbers and formatted to show the date, but you can't apply a format at row level. Therefore I would suggest creating 2 new columns, 1 for numbers and 1 for dates.
My preferred approach is to use the anyalpha function to exclude any records with an alphabetic character, followed by the anypunct function to identify if a punctuation character exists (this should identify dates rather than just numbers). The anydtdte informat is then used to extract the date, this is a very useful informat as it reads dates stored in different ways (as per my note above).
There are clearly some caveats with this method.
If any numbers contain decimals then my method would incorrectly treat these as dates, therefore only integers will be assigned correctly.
It won't pick up dates that contain the month as words, e.g. 15-May-2015, as the anyalpha function would exclude them. They will need to contain numbers only, separated by any punctuation character.
Here's my code.
/* create initial dataset */
data have;
input actual_variable $ 50.;
datalines;
ODed on pills threw them all up - 2006
Y
1
5
ODed on pills
6
Less than once a week
N
N
2006-11-12
;
run;
/* extract dates and numbers */
data want;
set have;
if not anyalpha(actual_variable) then do; /* exclude records with an alphabetic character */
if anypunct(actual_variable) then new_date_variable = input(actual_variable,anydtdte10.); /* if a punctuation character exists then read in as a date */
else new_num_variable = input(actual_variable,best12.); /* else read in as a number */
end;
format new_date_variable yymmdd10.; /* show date field in required format */
run;