SAS: PROC FREQ combinations automatically? - sas

I have a patient dataset that looks like the below table and I would like to see which diseases run together and ultimately make a heatmap. I used PROC FREQ to make this list table, but it is too laborious to go through like this because it gives me every combination (thousands).
Moya Hypothyroid Hyperthyroid Celiac
1 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
1 1 0 0
1 0 1 0
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1
proc freq data=new;
tables HOHT*HOGD*CroD*Psor*Viti*CelD*UlcC*AddD*SluE*Rhea*PerA/list;
run;
I would ultimately like a bunch of cross tabs as I show below, so I can see how many patients have each combination. Obviously it's possible to copy paste each variable like this manually, but is there any way to see this quickly or automate this?
proc freq data=new;
tables HOHT*HOGD/list;
run;
proc freq data=new;
tables HOHT*CroD/list;
run;
proc freq data=new;
tables HOHT*Psor/list;
run;
Thanks!

One can control the tables generated in PROC FREQ with the TABLES statement. To generate tables that are 2-way contingency tables of all pairs of columns in a data set, one can write a SAS macro that loops through a list of variables, and generates TABLES statements to create all of the correct contingency tables.
For example, using the data from the original post:
data xtabs;
input Moya Hypothyroid Hyperthyroid Celiac;
datalines;
1 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
1 1 0 0
1 0 1 0
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1
;
run;
%macro gentabs(varlist=);
%let word_count = %sysfunc(countw(&varlist));
%do i = 1 %to (&word_count - 1);
tables %scan(&varlist,&i,%str( )) * (
%do j = %eval(&i + 1) %to &word_count;
%scan(&varlist,&j,%str( ))
%end; )
; /* end tables statement */
%end;
%mend;
options mprint;
proc freq data = xtabs;
%gentabs(varlist=Moya Hypothyroid Hyperthyroid Celiac)
run;
The code generated by the SAS macro is:
73 proc freq data = xtabs;
74 %gentabs(varlist=Moya Hypothyroid Hyperthyroid Celiac)
MPRINT(GENTABS): tables Moya * ( Hypothyroid Hyperthyroid Celiac ) ;
MPRINT(GENTABS): tables Hypothyroid * ( Hyperthyroid Celiac ) ;
MPRINT(GENTABS): tables Hyperthyroid * ( Celiac ) ;
75 run;
...and the first few tables from the resulting output looks like:
To add options to the TABLES statement, one would add code before the semicolon on the line commented as /* end tables statement */.

Proc MEANS is one common tool for obtaining a variety of statistics for a combinatoric group with in the data. In your case you want only the count of each combination.
Suppose you had 10,000 patients with 10 binary factors
data patient_factors;
do patient_id = 1 to 10000;
array factor(10);
do _n_ = 1 to dim(factor);
factor(_n_) = ranuni(123) < _n_/(dim(factor)+3);
end;
output;
end;
format factor: 4.;
run;
As you mentioned, Proc FREQ can compute the counts of each 10-level combination.
proc freq noprint data=patient_factors;
table
factor1
* factor2
* factor3
* factor4
* factor5
* factor6
* factor7
* factor8
* factor9
* factor10
/ out = pf_10deep
;
run;
FREQ does not have syntax to support creating output data that contains each pairwise combination involving factor1.
Proc MEANS does have the syntax for such output.
proc means noprint data=patient_factors;
class factor1-factor10;
output out=counts_paired_with_factor1 n=n;
types factor1 * ( factor2 - factor10 );
run;

Related

Using proc freq to cross tabulate within same ID that has 2 occurences

I have a data set where ID's have 2 different occurences on the same day. There are about 10 different occurences. I want to cross tabulate the occurences using proc freq or proc tabulate & find how many times each instance occurs on the same day. I want my table to look something like this
Frequency occ1 occ2 occ3 occ4 occ5 occ6
occ1 2 0 0 1 4 0
occ2 1 0 0 0 0 0
occ3 3 0 0 0 0 0
occ4 0 5 3 0 3 0
occ5 0 2 4 0 5 0
occ6 1 5 4 2 1 2
My data looks something like this
data have;
input id occurrence ;
datalines;
id1 occ3
id1 occ2
id2 occ1
id2 occ6
id3 occ2
id3 occ4
etc...
i tried
proc freq data=have;
tables occurrence*occurence ;
run;
but not having any luck.
I have tried other variations & using by ID but it gives every single ID individually & i have about 200 ID numbers.
Thanks!
Reform data so a tabulation of ordered pairs can be done.
data have;
call streaminit(2022);
do id = 1 to 20;
topic = rand('integer', 10); output;
topic = rand('integer', 10); output;
end;
run;
data stage;
do until (last.id);
set have;
by id;
row = col;
col = topic;
end;
run;
ods html file='pairfreq.html';
title "Ordered pair counts";
proc tabulate data=stage;
class row col;
table row='1st topic in id pair',col='2nd topic in id pair'*n='';
run;
ods html close;

Divide a dataset into subsets based on a column and perform a repeated operation for subsets

I need to perform the same operation on many different periods. In my sample data for two periods: 402 and 403.
I cannot understand the concept of how I can make a loop that will do it for me.
At the end, I'd like to have final1 for period 402, final2 for period 403 etc.
Sample data that I use for testing:
data one;
input period $ a $ b $ c $ d e;
cards;
402 a . a 1 3
402 . b . 2 4
402 a a a . 5
402 . . b 3 5
403 a a a . 6
403 a a a . 7
403 a a a 2 8
;
run;
This is how I manually choose one period of one data:
data new;
set one;
where period='402';
run;
This is how I calculate different things for the given period e.g. number of missing data, non-missing, total:
1 - For numeric variables:
proc iml;
use new;
read all var _NUM_ into x[colname=nNames];
n = countn(x,"col");
nmiss = countmiss(x,"col");
ntotal = n + nmiss;
2 - and similarly for char variables:
read all var _CHAR_ into x[colname=cNames];
close nww;
c = countn(x,"col");
cmiss = countmiss(x,"col");
ctotal = c + cmiss;
Save numeric and char results:
create cnt1Data var {nNames n nmiss ntotal};
append;
close cnt1Data;
create cnt2Data var {cNames c cmiss ctotal};
append;
close cnt2Data;
Rename columns to be the same:
data cnt1Datatemp;
set cnt1Data;
rename nNames = Name n = nonMissing nmiss = missing ntotal = total;
run;
data cnt2Datatemp;
set cnt2Data;
rename cNames = Name c = nonMissing cmiss = missing ctotal = total;
run;
and merge data into the final set:
data final;
set cnt1Datatemp cnt2Datatemp;
run;
Final data for period 402 should look like:
a b c d e
2 2 1 1 0 - missing
2 2 3 3 4 - non-missing
4 4 4 4 4 - total
and respectively for period 403:
a b c d e
0 0 0 2 0 - missing
3 3 3 1 3 - non-missing
3 3 3 3 3 - total
You can make something similar with simple SQL query.
create table miss_count as select period
, sum(missing(A)) as A
, sum(missing(B)) as B
...
from have
group by period
;
Results:
period a b c d e
402 2 2 1 1 0
403 0 0 0 2 0
It you add in
, count(*) as nobs
then you have all the information you need to calculate all of the counts you wanted.
If the number of variables is short enough you can even generate the code into a macro variable (limit of 64K bytes in a macro variable)
proc sql noprint;
select catx(' ','sum(missing(',nliteral(name),')) as',nliteral(name))
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='ONE' and lowcase(name) ne 'period'
;
create table miss_count as select period,count(*) as nobs,&varlist
from one
group by period
;
quit;
Results:
period nobs a b c d e
402 4 2 2 1 1 0
403 3 0 0 0 2 0
It is much easier to find this information in sql;
proc sql;
select sum(a is not missing) as fil_a
, sum(a is missing) as mis_a
, count(*) as tot_a
from one
where period eq 402;
quit;
You can even 0handle all periods at once using group by.
There are a few ways to make this work for all variables in a dataset (except for some group by variables). For instance:
%macro count_missing();
proc sql;
select count(*), name
into :no_var, :var_list separated by ' '
from sasHelp.vcolumn
where libName eq 'WORK' and memName eq 'ONE' and upcase(name) ne 'PERIOD';
create view count_missing as
select count(*) as total
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
, sum(&var is missing) as mis_&var
%end;
from work.one
group by period;
quit;
data report_missing;
set count_missing;
format count_of $32.;
count_of = 'missing';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = mis_&var;
%end;
output;
count_of = 'non missing';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = total - mis_&var;
%end;
output;
count_of = 'total';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = total;
%end;
output;
end;
%mend;
%count_missing();
You don't need iml to summarize data over observations. You can do that with a retain statement too. Moreover, using by processing with first and last, you can process all periods in one go.
data final;
set one;
by period;
if first.period then do;
mis_a = 0;
total = 0;
end;
retain mis_a;
if missing(a) then mis_a +=1; else fil_a += 1;
total += 1;
if last.period;
fil_a = total - mis_a;
end;
This is by far the fastest way to handle a big dataset if the data is sorted by period.
To make it work for a set of variables not known upfront, you can apply the same techniques as in my other solution.

How to recode values of a variable based on the maxmium value in the variable, for hundreds of variables?

I want to recode the max value of a variable as 1 and 0 when it is not. For each variable, there may be multiple observations with the max value. The max value for each value is not fixed, i.e. from cycle to cycle the max value for each variable may change. And there are hundreds of variables, cannot "hard-code" anything.
The final product would have the same dimensions as the original table, i.e. equal number of rows and columns as a matrix of 0s and 1s.
This is within SAS. I attempted to calculate the max of each variable and then append these max as a new observation into the data. Then comparing down the column of each variable against the "max" observation... looking into examples of the following did not help:
SQL
Array in datastep
proc transpose
formatting
Any insight would be much appreciated.
Here is a version done with SQL:
The idea is that we first calculate the maximum. The Latter select. Then we join the data to original and the outer the case-select specifies if the flag is set up or not.
data begin;
input var value;
cards;
1 1
1 2
1 3
1 2.5
1 1.7
1 3
2 34
2 33
2 33
2 33.7
2 34
2 34
; run;
proc sql;
create table result as
select a.var, a.value, case when a.value = b.maximum then 1 else 0 end as is_max from
(select * from begin) a
left join
(select max(value) as maximum, var from begin group by var) b
on a.var = b.var
;
quit;
To avoid "hard-code" you need to use some code generation.
First let's figure out what code you could use to solve the problem. Later we can look into ways to generate that code.
It is probably easiest to do this with PROC SQL code. SAS will allow you to reference the MAX() value of a variable. Also note that SAS evaluates boolean expressions to 1 (TRUE) or 0 (FALSE). So you just want to generate code like:
proc sql;
create table want as
select var1=max(var1) as var1
, var2=max(var2) as var2
from have
;
quit;
To generate the code you need a list of the variables in your source dataset. You can get those with PROC CONTENTS but also with the metadata table (view) DICTIONARY.COLUMNS (also accessible as SASHELP.VCOLUMN from outside PROC SQL).
If the list of variables is small then you could generate the code into a single macro variable.
proc sql noprint;
select catx(' ',cats(name,'=max(',name,')'),'as',name)
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='HAVE'
order by varnum
;
create table want as
select &varlist
from have
;
quit;
The maximum number of characters that will fit into a macro variable is 64K. So long enough for about 2,000 variables with names of 8 characters each.
Here is little more complex way that uses PROC SUMMARY and a data step with a temporary array. It does not really need any code generation.
%let dsin=sashelp.class(obs=10);
%let dsout=want;
%let varlist=_numeric_;
proc summary data=&dsin nway ;
var &varlist;
output out=summary(drop=_type_ _freq_) max= ;
run;
data &dsout;
if 0 then set &dsin;
array vars &varlist;
array max [10000] _temporary_;
if _n_=1 then do;
set summary ;
do _n_=1 to dim(vars);
max[_n_]=vars[_n_];
end;
end;
set &dsin;
do _n_=1 to dim(vars);
vars[_n_]=vars[_n_]=max[_n_];
end;
run;
Results:
Obs Name Sex Age Height Weight
1 Alfred M 0 1 1
2 Alice F 0 0 0
3 Barbara F 0 0 0
4 Carol F 0 0 0
5 Henry M 0 0 0
6 James M 0 0 0
7 Jane F 0 0 0
8 Janet F 1 0 1
9 Jeffrey M 0 0 0
10 John M 0 0 0

SAS make summary statistic not available in proc mean

I have a table with very many columns but for the in order to explain my
problem I will use this simple table.
data test;
input a b c;
datalines;
0 0 0
1 1 1
. 4 2
;
run;
I need to calculate the common summary statistic as min, max and number of missing. But I also need to calculate some special numbers as number of values above a certain level( in this example >0 and >1.
I can use proc mean but it only give me results for normal things like min, max etc.
What I want is result on the following format:
var minval maxval nmiss n_above1 n_above2
a 0 1 1 1 0
b 0 4 0 2 1
c 0 2 0 2 1
I have been able to make this informat for one variable with this rather
stupid code:
data result;
set test(keep =b) end=last;
variable = 'b';
retain minval maxval;
if _n_ = 1 then do;
minval = 1e50;
maxval = -1e50;
end;
if minval > b then minval = b;
if maxval < b then maxval = b;
if b=. then nmiss+1;
if b>0 then n_above1+1;
if b>2 then n_above2+1;
if last then do;
output;
end;
drop b;
run;
This produce the following table:
variable minval maxval nmiss n_above1 n_above2
b 0 4 0 2 1
I know there has to be better way do this. I am used to Python and Pandas. There I will only loop through each variable, calculate the different summary statistick and append the result to a new dataframe for each variable.
I can probably also use proc sql. The next example
proc sql;
create table res as
select count(case when a > 0 then 1 end) as n_above1_a,
count(case when b > 0 then 1 end) as n_above1_b,
count(case when c > 0 then 1 end) as n_above1_c
from test;
quit;
This gives me:
n_above1_a n_above1_b n_above1_c
1 2 2
But this do not solve my problem.
If you add an unique identifier to each row then you can just use PROC TRANSPOSE and PROC SQL to get your result.
data test;
input a b c;
id+1;
datalines;
0 0 0
1 1 1
. 4 2
;
proc transpose data=test out=tall ;
by id ;
run;
proc sql noprint ;
create table want as
select _name_
, min(col1) as minval
, max(col1) as maxval
, sum(missing(col1)) as nmiss
, sum(col1>1) as n_above1
, sum(col1>2) as n_above2
from tall
group by _name_
;
quit;
Result
Obs _NAME_ minval maxval nmiss n_above1 n_above2
1 a 0 1 1 0 0
2 b 0 4 0 1 1
3 c 0 2 0 1 0

Count number of 0 values

Similar to here, I can count the number of missing observations:
data dataset;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc means data=dataset NMISS N;
run;
But how can I also count the number of observations that are 0?
If you want to count the number of observations that are 0, you'd want to use proc tabulate or proc freq, and do a frequency count.
If you have a lot of values and you just want "0/not 0", that's easy to do with a format.
data have;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc format;
value zerof
0='Zero'
.='Missing'
other='Not Zero';
quit;
proc freq data=have;
format _numeric_ zerof.;
tables _numeric_/missing;
run;
Something along those lines. Obviously be careful about _numeric_ as that's all numeric variables and could get messy quickly if you have a lot of them...
I add this as an additional answer. It requires you to have PROC IML.
This uses matrix manipulation to do the count.
(ds=0) -- creates a matrix of 0/1 values (false/true) of values = 0
[+,] -- sums the rows for all columns. If we have 0/1 values, then this is the number of value=0 for each column.
' -- operator is transpose.
|| -- merge matrices {0} || {1} = {0 1}
Then we just print the values.
proc iml;
use dataset;
read all var _num_ into ds[colname=names];
close dataset;
ds2 = ((ds=0)[+,])`;
n = nrow(ds);
ds2 = ds2 || repeat(n,ncol(ds),1);
cnames = {"N = 0", "Count"};
mattrib ds2 rowname=names colname=cnames;
print ds2;
quit;
Easiest to use PROC SQL. You will have to use a UNION to replicate the MEANS output;
Each section of the first FROM counts the 0 values for each variable and UNION stacks them up.
The last section just counts the number of observations in DATASET.
proc sql;
select n0.Variable,
n0.N_0 label="Number 0",
n.count as N
from (
select "A" as Variable,
count(a) as N_0
from dataset
where a=0
UNION
select "B" as Variable,
count(b) as N_0
from dataset
where b=0
UNION
select "C" as Variable,
count(c) as N_0
from dataset
where c=0
) as n0,
(
select count(*) as count
from dataset
) as n;
quit;
there is levels options in proc freq you could use.
proc freq data=dataset levels;
table _numeric_;
run;