PROC SUMMARY using multiple fields as a key - sas

Is there a way to replicate the behaviour below using PROC SUMMARY without having to create the dummy variable UniqueKey?
DATA table1;
input record_desc $ record_id $ response;
informat record_desc $4. record_id $1. response best32.;
format record_desc $4. record_id $1. response best32.;
DATALINES;
A001 1 300
A001 1 150
A002 1 200
A003 1 150
A003 1 99
A003 2 250
A003 2 450
A003 2 250
A004 1 900
A005 1 100
;
RUN;
DATA table2;
SET table1;
UniqueKey = record_desc || record_id;
RUN;
PROC SUMMARY data = table2 NWAY MISSING;
class UniqueKey record_desc record_id;
var response;
output out=table3(drop = _FREQ_ _TYPE_) sum=;
RUN;

You can summarise by record_desc and record_id (see class statement below) without creating a concatenation of the two columns. What made you think you couldn't?
PROC SUMMARY data = table1 NWAY MISSING;
class record_desc record_id;
var response;
output out=table4(drop = _FREQ_ _TYPE_) sum=;
RUN;
proc compare
base=table3
compare=table4;
run;

Related

Looping around a whole data step in SAS

So I have a following code:
%let macroVar = Var1 Var2;
Data new1;
Set old1 (keep= count &macroVar.);
Run;
Proc means data = new1 nway missing noprint;
Class var1;
Var count;
Output out= out_var1 sum=;
Proc means data = new1 nway missing noprint;
Class var2;
Var count;
Output out= out_var2 sum=;
How can I write out the two proc means in one data step, using the macro variable I set up at the beginning?
Many thanks
Why not just do it in one PROC MEANS call?
%let varlist=sex age ;
proc means data=sashelp.class missing noprint;
class &varlist;
ways 1;
var height;
output out=want sum=;
run;
Result:
Obs Sex Age _TYPE_ _FREQ_ Height
1 11 1 2 108.8
2 12 1 5 297.2
3 13 1 3 184.3
4 14 1 4 259.6
5 15 1 4 262.5
6 16 1 1 72.0
7 F . 2 9 545.3
8 M . 2 10 639.1
If you really must have separate output datasets then it will be faster to generate them from the output above.
%macro sum_counts(varlist,data=,var=count);
%local i n;
%let n=%sysfunc(countw(&varlist));
proc means data=&data missing noprint ;
class &varlist;
ways 1;
var &var;
output out=_summary_ sum=;
run;
%do i=1 %to &n;
data new&i;
set _summary_;
where _type_=2**(&n-&i);
keep %scan(&varlist,&i) &var;
run;
%end;
%mend sum_counts;
Example:
263 options mprint;
264 %sum_counts(varlist=sex age,data=sashelp.class,var=height);
MPRINT(SUM_COUNTS): proc means data=sashelp.class missing noprint ;
MPRINT(SUM_COUNTS): class sex age;
MPRINT(SUM_COUNTS): ways 1;
MPRINT(SUM_COUNTS): var height;
MPRINT(SUM_COUNTS): output out=_summary_ sum=;
MPRINT(SUM_COUNTS): run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK._SUMMARY_ has 8 observations and 5 variables.
MPRINT(SUM_COUNTS): data new1;
MPRINT(SUM_COUNTS): set _summary_;
MPRINT(SUM_COUNTS): where _type_=2**(2-1);
MPRINT(SUM_COUNTS): keep sex height;
MPRINT(SUM_COUNTS): run;
NOTE: There were 2 observations read from the data set WORK._SUMMARY_.
WHERE _type_=2;
NOTE: The data set WORK.NEW1 has 2 observations and 2 variables.
MPRINT(SUM_COUNTS): data new2;
MPRINT(SUM_COUNTS): set _summary_;
MPRINT(SUM_COUNTS): where _type_=2**(2-2);
MPRINT(SUM_COUNTS): keep age height;
MPRINT(SUM_COUNTS): run;
NOTE: There were 6 observations read from the data set WORK._SUMMARY_.
WHERE _type_=1;
NOTE: The data set WORK.NEW2 has 6 observations and 2 variables.

SAS Demographic Table

I have been trying to create a demographic table like below this but I can't seem append the different tables. Please advise on where I can make adjustments in the code.
Group A Group B
chort 1 cohort 2 cohort 3 subtotal cohort 4 cohort 5 cohort 6 subtotal
Age
n
mean
sd
median
min
Gender
n
female
male
Race
n
white
asian
hispanic
black
My Code:
PROC FORMAT;
value content
1=' '
2='Age'
3='Gender'
4='Race'
value sex
1=' n'
2=' female'
3=' male';
value race
1=' n'
2=' white'
3=' asian'
4=' hispanic'
5=' black';
value stat
1=' n'
2=' Mean'
3=' Std. Dev.'
4=' Median'
5=' Minimum';
RUN;
DATA testtest;
SET test.test(keep = id group cohort age gender race);
RUN;
data tottest;
set testtest;
output;
if prxmatch('m/COHORT 1|COHORT 2|COHORT 3/oi', cohort) then do;
cohort='Subtotal';
output;
end;
if prxmatch('m/COHORT 4|COHORT 5|COHORT 6/oi', cohort) then do;
cohort='Subtotal';
output;
end;
run;
data count;
if 0 then set testtest nobs=npats;
call symput('npats',put(npats,1.));
stop;
run;
proc freq data=tottest;
tables cohort /out=patk0 noprint;
tables cohort*sex /out=sex0 noprint;
tables cohort*race /out=race0 noprint;
run;
PROC MEANS DATA = testtest n mean std min median;
class cohort;
VAR age;
RUN;
I know that I would have to transpose it and out it in a report. But before I do that, how do I get the variable out of my proc means, proc freq, etc?

Isolate Patients with 2 diagnoses but diagnosis data is on different lines

I have a dataset of patient data with each diagnosis on a different line.
This is an example of what it looks like:
patientID diabetes cancer age gender
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
I need to isolate the patients who have a diagnosis of both diabetes and cancer; their unique patient identifier is patientID. Sometimes they are both on the same line, sometimes they aren't. I am not sure how to do this because the information is on multiple lines.
How would I go about doing this?
This is what I have so far:
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(DOB) as DOB
from diab_dx
group by patientID;
quit;
data final; set want;
if diabetes GE 1 AND cancer GE 1 THEN both = 1;
else both =0;
run;
proc freq data=final;
tables both;
run;
Is this correct?
If you want to learn about data steps lookup how this works.
data pat;
input patientID diabetes cancer age gender:$1.;
cards;
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
;;;;
run;
data both;
do until(last.patientid);
set pat; by patientid;
_diabetes = max(diabetes,_diabetes);
_cancer = max(cancer,_cancer);
end;
both = _diabetes and _cancer;
run;
proc print;
run;
add a having statement at the end of sql query should do.
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(age) as DOB
from PAT
group by patientID
having calculated diabetes ge 1 and calculated cancer ge 1;
quit;
You might find some coders, especially those coming from statistical backgrounds, are more likely to use Proc MEANS instead of SQL or DATA step to compute the diagnostic flag maximums.
proc means noprint data=have;
by patientID;
output out=want
max(diabetes) = diabetes
max(cancer) = cancer
min(age) = age
;
run;
or for the case of all the same aggregation function
proc means noprint data=have;
by patientID;
var diabetes cancer;
output out=want max= ;
run;
or
proc means noprint data=have;
by patientID;
var diabetes cancer age;
output out=want max= / autoname;
run;

Setting names to idgroup

Follow up to
SAS - transpose multiple variables in rows to columns
I have the following code:
data have;
input CX_ID 1. TYPE $1. COUNT_RATE 1. SUM_RATE 2.;
datalines;
1A110
1B220
2A120
;
run;
proc summary data = have nway;
class cx_id;
output out=want (drop = _:)
idgroup(out[2] (count_rate sum_rate)= count sum);
run;
So this table:
CX_ID TYPE COUNT_RATE SUM_RATE
1 A 1 10
1 B 2 20
2 A 1 20
becomes
CX_ID COUNT_1 COUNT_2 SUM_1 SUM_2
1 1 2 10 20
2 1 . 20 .
Which is perfect, but how do I set the names to be
Count_A Count_B Sum_A Sum_B
Or in general whatever the value in the type field of the have table ?
Thank you
A double PROC TRANSPOSE is dynamic and you can add a data step to customize the names easily.
*sample data;
data have;
input CX_ID 1. TYPE $1. COUNT 1. SUM 2.;
datalines;
1A110
1B220
2A120
;
run;
*transpose to long;
proc transpose data=have out=long;
by cx_id type;
run;
*transpose to wide;
proc transpose data=long out=wide;
by cx_id;
var col1;
id _name_ type;
run;

Count over columns in SAS

I have a data set in SAS containing individuals as rows and a variable for each period as columns. It looks something like this:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
What I want is for SAS to count how many there is of each number for each time period. So I want to get something like it:
data want;
input count t1 t2 t3;
cards;
111 1 3 1
112 3 1 0
123 0 0 3
;
run;
I could do this with proc freq, but outputting this doesn't work very well, when I have a lot of columns.
Thanks
In general having data in the meta data is a bad idea, as here where PERIOD is coded into the Tn variables and you really want that to be a group. Having said that you can still have your cake and eat it too.
PROC SUMMARY can get the counts for each Tn quickly and then you will have smaller data set to fiddle with. Here is one approach that should work well for many time periods.
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;;;;
run;
proc print;
run;
proc summary data=have chartype;
class t:;
ways 1;
output out=want;
run;
proc print;
run;
data want;
set want;
p = findc(_type_,'1');
c = coalesce(of t1-t3);
run;
proc print;
run;
proc summary data=want nway completetypes;
class c p;
freq _freq_;
output out=final;
run;
proc print;
run;
proc transpose data=final out=morefinal(drop=_name_) prefix=t;
by c;
id p;
var _freq_;
run;
proc print;
run;
First restructure the data so that it is in more of a vertical fashion. This will be easier to work with. We also want to create a flag that we will use as a counter later on.
data have2;
set have;
array arr[*] t1-t3;
flag = 1;
do period=lbound(arr) to hbound(arr);
val = arr[period];
output;
end;
keep period val flag;
run;
Summarize the data so we have the number of times that value occurred in each of the periods.
proc sql noprint;
create table smry as
select val,
period,
sum(flag) as count
from have3
group by 1,2
order by 1,2
;
quit;
Transpose the data so we have one line per value and then the counts for each period after that:
proc transpose data=smry out=want(drop=_name_);
by val;
id period;
var count;
run;
Note that when you define the array in the first step you could use this notation which would allow for a dynamic number of periods:
array arr[*] t:;
This assumes every variable beginning with 't' in the dataset should go into the array.
If your computer memory is large enough to hold the entire output, then Hash could be a viable solution:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
data _null_;
if _n_=1 then
do;
/*This is to construct a Hash, where count is tracked and t1-t3 is maintained*/
declare hash h(ordered:'a');
h.definekey('count');
h.definedata('count', 't1','t2','t3');
h.definedone();
call missing(count, t1,t2,t3);
end;
set have(rename=(t1-t3=_t1-_t3))
/*rename to avoid conflict between input data and Hash object*/
end=last;
array _t(*) _t:;
array t(*) t:;
/*The key is to set up two arrays, one is for input data,
another is for Hash feed, and maneuver their index variable accordingly*/
do i=1 to dim(_t);
count=_t(i);
rc=h.find(); /*search the Hash and bring back data elements if found*/
/*If there is a match, then corresponding 't' will increase by '1'*/
if rc=0 then
t(i)+1;
else
do;
/*If there is no match, then corresponding 't' will be initialized as '1',
and all of the other 't' reset to '0'*/
do j=1 to dim(t);
t(j)=0;
end;
t(i)=1;
end;
rc=h.replace(); /*Update the Hash*/
end;
if last then
rc=h.output(dataset:'want');
run;
Try this:
%macro freq(dsn);
proc sql;
select name into:name separated by ' ' from dictionary.columns where libname='WORK' and memname='HAVE' and name like 't%';
quit;
%let ncol=%sysfunc(countw(&name,%str( )));
%do i=1 %to &ncol;
%let col=%scan(&name,&i);
proc freq data=have;
table &col/out=col_&i(keep=&col count rename=(&col=count count=&col));
run;
%end;
data temp;
merge
%do i=1 %to &ncol;
col_&i
%end;
;
by count;
run;
data want;
set temp;
array vars t:;
do over vars;
if missing(vars) then vars=0;
end;
run;
%mend;
%freq(have)