SAS Chi Square Test - sas

I have the following dataset on which I intend to perform a chi square test (all variables being categorical).
Indicator Area Range1 Range2
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 0-17 25-50
1 B 17-25 25-50
1 B 0-17 17-25
1 B 17-25 25-50
The test is required to be perform at all levels namely for range1,range2 & area.One way to do it is to create a macro to do the same.But I have around 300 variables & to call the macro 300 times is not efficient. The code that I use for 3 variables is as follows:
options mprint mlogic symbolgen;
%macro chi_test(vars_test);
proc freq data =testdata.AllData;
tables &vars_test*Indicator/ norow nocol nopercent chisq ;
output out=stats_&vars_test &vars_test PCHI;
run;
data all_chi;
set stats_:;
run;
%mend chi_test;
%chi_test(Range1);
%chi_test(Range2);
%chi_test(Area);
Can any one help out?

Why not just transpose the data and use BY group processing.
First add a unique row identifier so that PROC TRANSPOSE can convert your variables to a single column.
data have_extra;
row+1;
set have;
run;
proc transpose data=have_extra out=tall ;
by row indicator ;
var area range1 range2 ;
run;
Then order the records by the original variable name.
proc sort; by _name_ ; run;
Then you can run your CHI-SQ for each of your original variables.
proc freq data =tall ;
by _name_;
tables col1*Indicator/ norow nocol nopercent chisq ;
output out=all_chi PCHI;
run;

If all your variables are categorical then you can use _all_in the tables statement, along with ods output for the dataset. This creates a single dataset with all the combinations of variables * Indicator.
If you wanted, you can apply dataset options (where=, keep=, drop= etc) against the output dataset.
data have;
input Indicator Area $ Range1 $ Range2 $;
datalines;
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 0-17 25-50
1 B 17-25 25-50
1 B 0-17 17-25
1 B 17-25 25-50
;
run;
ods select chisq;
ods output chisq=want;
proc freq data=have;
tables _all_*Indicator/ norow nocol nopercent chisq;
run;

Related

Is there a way in SAS to print the value of a variable in label using proc sql?

I have a situation where I would like to put the value of a variable in the label in SAS.
Example: Median for Total_Days is 2. I would like to put this value in Days_Median_Split label. The median keeps on changing with varying data, so I would like to automate it.
Phy_Activity Total_Days "Days_Median_Split: Number of Days with Median 2"
No 0 0
No 0 0
Yes 2 1
Yes 3 1
Yes 5 1
Sample Dataset
Thanks so much!
* step 1 create data;
data have;
input Phy_Activity $ Total_Days Days_Median_Split;
datalines;
No 0 0
No 0 0
Yes 2 1
Yes 3 1
Yes 5 1
run;
*step 2 sort data on Total_days;
proc sort data = have;
by Total_days;
run;
*step 3 get count of obs;
proc sql noprint;
select count(*) into: cnt
from have;quit;
* step 4 calulate median;
%let median = %sysevalf(&cnt/2 + .5);
*step 5 get median obsevation;
proc sql noprint;
select Total_days into: medianValue
from have
where monotonic()=&median;quit;
*step 6 create label;
data have;
set have;
label Days_Median_split = 'Days_Median_split: Number of Days with Median '
%trim(&medianValue);
run;

Isolate Patients with 2 diagnoses but diagnosis data is on different lines

I have a dataset of patient data with each diagnosis on a different line.
This is an example of what it looks like:
patientID diabetes cancer age gender
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
I need to isolate the patients who have a diagnosis of both diabetes and cancer; their unique patient identifier is patientID. Sometimes they are both on the same line, sometimes they aren't. I am not sure how to do this because the information is on multiple lines.
How would I go about doing this?
This is what I have so far:
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(DOB) as DOB
from diab_dx
group by patientID;
quit;
data final; set want;
if diabetes GE 1 AND cancer GE 1 THEN both = 1;
else both =0;
run;
proc freq data=final;
tables both;
run;
Is this correct?
If you want to learn about data steps lookup how this works.
data pat;
input patientID diabetes cancer age gender:$1.;
cards;
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
;;;;
run;
data both;
do until(last.patientid);
set pat; by patientid;
_diabetes = max(diabetes,_diabetes);
_cancer = max(cancer,_cancer);
end;
both = _diabetes and _cancer;
run;
proc print;
run;
add a having statement at the end of sql query should do.
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(age) as DOB
from PAT
group by patientID
having calculated diabetes ge 1 and calculated cancer ge 1;
quit;
You might find some coders, especially those coming from statistical backgrounds, are more likely to use Proc MEANS instead of SQL or DATA step to compute the diagnostic flag maximums.
proc means noprint data=have;
by patientID;
output out=want
max(diabetes) = diabetes
max(cancer) = cancer
min(age) = age
;
run;
or for the case of all the same aggregation function
proc means noprint data=have;
by patientID;
var diabetes cancer;
output out=want max= ;
run;
or
proc means noprint data=have;
by patientID;
var diabetes cancer age;
output out=want max= / autoname;
run;

Setting names to idgroup

Follow up to
SAS - transpose multiple variables in rows to columns
I have the following code:
data have;
input CX_ID 1. TYPE $1. COUNT_RATE 1. SUM_RATE 2.;
datalines;
1A110
1B220
2A120
;
run;
proc summary data = have nway;
class cx_id;
output out=want (drop = _:)
idgroup(out[2] (count_rate sum_rate)= count sum);
run;
So this table:
CX_ID TYPE COUNT_RATE SUM_RATE
1 A 1 10
1 B 2 20
2 A 1 20
becomes
CX_ID COUNT_1 COUNT_2 SUM_1 SUM_2
1 1 2 10 20
2 1 . 20 .
Which is perfect, but how do I set the names to be
Count_A Count_B Sum_A Sum_B
Or in general whatever the value in the type field of the have table ?
Thank you
A double PROC TRANSPOSE is dynamic and you can add a data step to customize the names easily.
*sample data;
data have;
input CX_ID 1. TYPE $1. COUNT 1. SUM 2.;
datalines;
1A110
1B220
2A120
;
run;
*transpose to long;
proc transpose data=have out=long;
by cx_id type;
run;
*transpose to wide;
proc transpose data=long out=wide;
by cx_id;
var col1;
id _name_ type;
run;

Count number of 0 values

Similar to here, I can count the number of missing observations:
data dataset;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc means data=dataset NMISS N;
run;
But how can I also count the number of observations that are 0?
If you want to count the number of observations that are 0, you'd want to use proc tabulate or proc freq, and do a frequency count.
If you have a lot of values and you just want "0/not 0", that's easy to do with a format.
data have;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc format;
value zerof
0='Zero'
.='Missing'
other='Not Zero';
quit;
proc freq data=have;
format _numeric_ zerof.;
tables _numeric_/missing;
run;
Something along those lines. Obviously be careful about _numeric_ as that's all numeric variables and could get messy quickly if you have a lot of them...
I add this as an additional answer. It requires you to have PROC IML.
This uses matrix manipulation to do the count.
(ds=0) -- creates a matrix of 0/1 values (false/true) of values = 0
[+,] -- sums the rows for all columns. If we have 0/1 values, then this is the number of value=0 for each column.
' -- operator is transpose.
|| -- merge matrices {0} || {1} = {0 1}
Then we just print the values.
proc iml;
use dataset;
read all var _num_ into ds[colname=names];
close dataset;
ds2 = ((ds=0)[+,])`;
n = nrow(ds);
ds2 = ds2 || repeat(n,ncol(ds),1);
cnames = {"N = 0", "Count"};
mattrib ds2 rowname=names colname=cnames;
print ds2;
quit;
Easiest to use PROC SQL. You will have to use a UNION to replicate the MEANS output;
Each section of the first FROM counts the 0 values for each variable and UNION stacks them up.
The last section just counts the number of observations in DATASET.
proc sql;
select n0.Variable,
n0.N_0 label="Number 0",
n.count as N
from (
select "A" as Variable,
count(a) as N_0
from dataset
where a=0
UNION
select "B" as Variable,
count(b) as N_0
from dataset
where b=0
UNION
select "C" as Variable,
count(c) as N_0
from dataset
where c=0
) as n0,
(
select count(*) as count
from dataset
) as n;
quit;
there is levels options in proc freq you could use.
proc freq data=dataset levels;
table _numeric_;
run;

SAS - Creating combinations of different independant variables with lags

I'd like to create dynamic entries in a data set (in SAS) formed using the names of variables (e.g. VarA, VarB, VarC) each having lags up to 4.
The input data set HAVE has this information (the column names are Variables and Values):
Variables Values
VarA 0
VarB 0
VarC 0
Lags 4
and the output data set WANT should be something like below (Var1, Var2, and Var3 are dynamic column names i.e. appending 1,2,3 to any string Var)
Var1 Var2 Var3
VarA VarB VarC
VarA1 VarB1 VarC1
..
VarA4 VarB4 VarC4
The intention is to have this work for any number of variables in HAVE data set.
Thanks
The following code returns what you want. Please modify according to your needs.
/*sample input dataset*/
data have;
input Variables $ Values;
datalines;
VarA 0
VarB 0
VarC 0
Lags 4
;
run;
/*get the no. of lags form the input dataset*/
proc sql noprint;
select Values into :num_of_lags from have where upcase(variables)='LAGS';
quit;
/*transpose the input dataset such that the VarA, VarB, VarC are put in columns Var1, Var2, & Var3 respectively*/
/*have_t, the transposed dataset only has 1 row.*/
proc transpose data = have out = have_t(drop = _name_) prefix = var;
where upcase(variables) ne 'LAGS';
var variables;
run;
/*replicate the 1 row in have_t num_of_lags times*/
data pre_want;
set have_t;
array myVars{*} _character_;
do j= 1 to &num_of_lags+1;
do i = 1 to dim(myVars);
myVars[i]=myVars[i];
end;
output;
end;
run;
/*final dataset*/
data want;
set pre_want;
array myVars{*} _character_;
if _N_>1 then do;
do i = 1 to dim(myVars);
myVars[i]=compress(myVars[i]!!_n_-1);
end;
end;
drop i j;
run;
proc print data = want; run;
Output:
var1 var2 var3
VarA VarB VarC
VarA1 VarB1 VarC1
VarA2 VarB2 VarC2
VarA3 VarB3 VarC3
VarA4 VarB4 VarC4