I am seeking to obtain risk ratio estimates from multiply imputed, cluster-correlated data in SAS using log binomial regression using SAS Proc Genmod. I've been able to calculate risk ratio estimates for the raw (non-MI) data, but it seems that the program is hitting a snag in generating an output dataset for me to read into Proc Mianalyze.
I am including a repeated subjects statement so that SAS will use robust variance estimation. Without the "repeated subjects" statement, the ODS Output statement seems to work just fine; however, once I include the "repeated subjects" statement, I receive an warning message that my output dataset was not generated.
I am open to other approaches and suggestions to generate risk ratio estimates using this data if the genmod/mianalyze combination is not appropriate, but would like to see if I can get this to work! I would prefer SAS, if possible, due to license access issues to other programs, like Stata and SUDAAN. My code is below, where "seroP" is my binomial outcome, "int" is the binomial independent variable of interest (intervention received vs not received), "tf5" is a binomial covariate, age is a continuous covariate, and village specifies the cluster:
Proc GenMod data=sc.wide_mip descending ; by _Imputation_;
Class int (ref='0') tf5 (ref='0') village /param=ref ;
weight weight;
Model seroP= int tf5 age /
dist=bin Link=logit ;
repeated subject=village/ type=unstr;
estimate 'Beta' int 1 -1/exp;
ods output ParameterEstimates=sc.seroP;
Run;
proc mianalyze parms =sc.seroP;
class int tf5 ;
modeleffects int tf5 age village ;
run;
Thank you for your help!
The short answer is to add an option “PRINTMLE” at the end of the “Repeated” statement. But the code you posted here may not produce what you actually want. So, following is a longer answer:
1.The program below is based on SAS 9.3 (or newer versions) for Windows. If you are using an older version, the coding might be different.
2.For PROC MIANALYZE, three ODS tables from PROC GENMOD are required instead of one, namely, 1) the parameter estimate table (_est); 2) the covariance table (_covb); and 3) the parameter index table (parminfo). The first line of the PROC MIANALYZE statement should look like:
PROC MIANALYZE parms = ~_est covb = ~_covb parminfo=parminfo;
whereas ~_est refers to an ODS parameter table, and ~_covb refers to an ODS covariance table.
There are different types of ODS parameter estimate and covariance tables. The sign “~” should be replaced by a specific set of ODS tables, which will be discussed in the following part.
3.From PROC GENMOD, three different sets of ODS parameter and covariance tables can be generated.
3a) The first set of tables is from a non-repeated model (i.e., without the “repeated” statement). In your case, it looks like:
Proc GenMod data=sc.wide_mip descending ; by _Imputation_;
…
MODEL seroP= int tf5 age/dist=bin Link=logit COVB; /*adding the option COVB*/
/*repeated subject=village/ type=unstr;*/
/*Note that the above line has been changed to comments*/
…
ODS OUTPUT
/*the estimates from a non-repeated model*/
ParameterEstimates=norepeat_est
/*the covariance from a non-repeated model*/
Covb = nonrepeat_covb
/*the indices of the parameters*/
ParmInfo=parminfo;
Run;
Of note, 1) the option COVB is added in the MODEL statement, so as to obtain the ODS covariance table. 2) The “Repeated” statement is put as comments. 3) The “~_est” table is named “nonrepeat_est”. Similarly, the table “~_covb” is named “nonrepeat_covb.
3b) The second set of tables contains model-based estimates from a repeated model. In your case, it looks like:
…
MODEL seroP= int tf5 age/dist=bin Link=logit;
REPEATED subject=village/ type=un MODELSE MCOVB;/*options added*/
…
ODS OUTPUT
/*the model-based estimates from a repeated model*/
GEEModPEst=mod_est
/*the model-based covariance from a repeated model*/
GEENCov= mod_covb
/*the indices of the parameters*/
parminfo=parminfo;
Run;
In the “REPEATED” statement, the option MODELSE is to generate model-based parameter estimates, and MCOVB is to generate the model based covariance. Without these options, the corresponding ODS tables (i.e., GEEModPEst and GEENCov) will not be generated. Note that the ODS table names are different from the previous case. In this case, the tables are GEEModPEst and GEENCov. In the previous case (a non-repeated model), the tables were ParameterEstimates and COVB. Here, the ~_est table is named “mod_est”, standing for the model-based estimates. Similarly, the ~_covb table is named “mod_covb”. The ParmInfo table is the same as in the previous model.
3c) A third set contains empirical estimates, also from a repeated model. The empirical estimates are also called ROBUST estimates. Sounds like the results here are what you want. It looks like:
…
MODEL seroP= int tf5 age/dist=bin Link=logit;
REPEATED subject=village/ type=un ECOVB;/*option changed*/
…
ODS OUTPUT
/*the empirical(ROBUST) estimates from a repeated model*/
GEEEmpPEst=emp_est
/*the empirical(ROBUST) covariance from a repeated model*/
GEERCov= emp_covb
/*the indices of the parameters*/
parminfo=parminfo;
Run;
As you may have noticed, in the “Repeated” statement, the option is changed to ECOVB. That way, the empirical covariance table will be generated. Nothing is required to generate the empirical parameter estimates, as they are always produced by the procedure. The ParmInfo table is the same as in the previous cases.
4.Putting together, actually you can generate the three sets of tables at the same time. The only thing is that, an option “PRINTMLE” should be added, so as to generate estimates from a non-repeated model when repeated terms are in place. The combined program looks like the following:
Proc GenMod data=sc.wide_mip descending ; by _Imputation_;
Class int (ref='0') tf5 (ref='0') village /param=ref ;
weight weight;
Model seroP= int tf5 age /
dist=bin Link=logit COVB; /*COVB to have non-repeated model covariance*/
repeated subject=village/ type=UN MODELSE PRINTMLE MCOVB ECOVB;/*all options*/
estimate 'Beta' int 1 -1/exp;
ODS OUTPUT
/*the estimates from a non-repeated model*/
ParameterEstimates=norepeat_est
/*the covariance from a non-repeated model*/
Covb = nonrepeat_covb
/*the indices of the parameters*/
ParmInfo=parminfo
/*the model-based estimates from a repeated model*/
GEEModPEst=mod_est
/*the model-based covariance from a repeated model*/
GEENCov= mod_covb
/*the empirical(ROBUST) estimates from a repeated model*/
GEEEmpPEst=emp_est
/*the empirical(ROBUST) covariance from a repeated model*/
GEERCov= emp_covb
;
Run;
/*Analyzing non-repeated results*/
PROC MIANALYZE parms = norepeat_est covb = norepeat_covb parminfo=parminfo;
class int tf5 ;
modeleffects int tf5 age village ;
run;
/*Analyzing model-based results*/
PROC MIANALYZE parms = mod_est covb = mod_covb parminfo=parminfo;
class int tf5 ;
modeleffects int tf5 age village ;
run;
/*Analyzing empirical(ROBUST) results*/
PROC MIANALYZE parms = emp_est covb = emp_covb parminfo=parminfo;
class int tf5 ;
modeleffects int tf5 age village ;
run;
Hopefully it helps. For further reading:
SAS proc genmod with clustered, multiply imputed data
http://www.ats.ucla.edu/stat/sas/v8/mianalyzev802.pdf
http://analytics.ncsu.edu/sesug/2006/ST12_06.PDF
Allison, Paul D. Logistic Regression Using SAS®: Theory and Application, Second Edition (page 226-234). Copyright © 2012, SAS Institute Inc.,Cary, North Carolina, USA.
Related
I want to know what the difference is when inserting variables in proc reg and then forecast the residuals with VARMAX
and
inserting the significant variables from proc reg to the VARMAX modelling.
In code:
Proc reg data=x printall;
dependent_variable=exogenous_variable1 exogenous_variable2 ... exogenous_variablep
white vif tol dw dwprob selection=b slentry=0.05;
output out=xforecast Rstudent=Rstudent student=student covratio=covratio h=h predicted=pred residual=residual;
Proc varmax data=xforecast
model residual/
p=7 q=2
method=ls dftest minic=(type=aic) print=(roots);
nloptions;
garch p=1 q=1 form=ccc;
output out=forecast1 lead=21;
run;
Or with only VARMAX:
proc varmax data=xforecast printall;
dependent_variable=exogenous_variable1 exogenous_variable2 ... exogenous_variablep/
p=7 q=2 method=ls dftest minic=(type=aic) print=(roots);
nloptions;
garch p=1 q=1 form=ccc;
output out=forecast1 lead=21;
With only running VARMAX it is also not possible to generate a forecast which i don't understand. The error code:
WARNING: The value of LEAD=21 in OUTPUT statement. There are only 0 future independent observations. The value of LEAD will take
the minimum of two values.
I want to compute multiple sums on the same column based on some criteria. Here is a small example using the sashelp.cars dataset.
The code below somewhat achieves what I want to do in three (3) different ways, but there is always a small problem.
proc report data=sashelp.cars out=test2;
column make type,invoice type,msrp;
define make / group;
define type / across;
define invoice / analysis sum;
define msrp / analysis sum;
title "Report";
run;
proc print data=test2;
title "Out table for the report";
run;
proc summary data=test nway missing;
class make type;
var invoice msrp;
output out=sumTest(drop= _Freq_ _TYPE_) sum=;
run;
proc transpose data=sumTest out=test3;
by make;
var invoice msrp;
id type;
run;
proc print data=test3;
title "Table using proc summary followed by proc transpose";
run;
proc sql undo_policy=none;
create table test4 as select
make,
sum(case when type='Sedan' then invoice else 0 end) as SedanInvoice,
sum(case when type='Wagon' then invoice else 0 end) as WagonInvoice,
sum(case when type='SUV' then invoice else 0 end) as SUVInvoice,
sum(case when type='Sedan' then msrp else 0 end) as Sedanmsrp,
sum(case when type='Wagon' then msrp else 0 end) as Wagonmsrp,
sum(case when type='SUV' then msrp else 0 end) as SUVmsrp
from sashelp.cars
group by make;
quit;
run;
proc print data=test4;
title "Table using SQL queries and CASE/WHEN to compute new columns";
run;
Here is the result I get when I run the presented code.
The first two tables represent the result and the out table of the report procedure. The problem I have with this approach is the column names produced by proc report. I would love to be able to define them myself, but I don't see how I can do this. It is important for further referencing.
The third table represent the result of the proc summary/proc transpose portion of the code. The problem I have with this approach is that Invoice and MSRP appears as rows in the table, instead of columns. For that reason, I think the proc report is better.
The last table represents the use of an SQL query. The result is exactly what I want, but the code is heavy. I have to do a lot of similar computation on my dataset and I believe this approach is cumbersome.
Could you help improve one of these methods ?
You can just use two PROC TRANSPOSE steps;
proc summary data=sashelp.cars nway missing;
where make=:'V';
class make type;
var invoice msrp;
output out=step1(drop= _Freq_ _TYPE_) sum=;
run;
proc transpose data=step1 out=step2;
by make type ;
var invoice msrp;
run;
proc transpose data=step2 out=step3(drop=_name_);
by make;
id type _name_ ;
var col1 ;
run;
proc print data=step3;
title "Table using proc summary followed by 2 proc transpose steps";
run;
Results:
Sedan Sedan Wagon Wagon
Obs Make SUVInvoice SUVMSRP Invoice MSRP Invoice MSRP
1 Volkswagen $32,243 $35,515 $335,813 $364,020 $77,184 $84,195
2 Volvo $38,851 $41,250 $313,990 $333,240 $57,753 $61,280
Use Proc TABULATE. Very succinct expressions for specifying row and column dimensions defined by desired hierarchy of class variables.
The intersection of these dimensions is a cell and represents a combination of values that select the values for which a statistical measure is displayed in the cell.
In your case the SUM is sum of dollars, which might not make sense when the cell has more then one contributing value.
For example: Does it make sense to show the invoice sum for 11 Volkswagen Sedan's is $335,813 ?
Also note the 'inverted' hierarchy used to show the number of contributing values.
Example:
proc tabulate data=sashelp.cars;
class make type;
var invoice msrp;
table
make=''
,
type * invoice * sum=''*f=dollar9.
type * msrp * sum=''*f=dollar9. /* this is an adjacent dimension */
(invoice msrp) * type * n='' /* specify another adjacent dimension, with inverted hierarchy */
/
box = 'Make'
;
where make =: 'V';
run;
Output
I've fit a linear regression onto a set of training data using both Proc Reg and Proc GLM. When I score the testing dataset, I can only create the Confidence using Proc PLM on the saved Proc GLM model - the Proc Reg model results in blanks (despite being the same model)
This is just a question on whether Proc Reg is incompatible with Proc PLM in generating Confidence intervals on test data.
The below code is runable on any machine (generates dummy data to regress on)
/* the original data; fit model to these values */
data A;
input x y ##;
datalines;
1 4 2 9 3 20 4 25 5 1 6 5 7 -4 8 12
;
/* the scoring data; evaluate model on these values */
%let NumPts = 200;
data ScoreX(keep=x);
min=1; max=8;
do i = 0 to &NumPts-1;
x = min + i*(max-min)/(&NumPts-1); /* evenly spaced values */
output; /* no Y variable; only X */
end;
run;
proc reg data=A outest=RegOut tableout;
model y = x; /* name of model is used by PROC SCORE */
store work.proc_reg_model;
quit;
ods output ParameterEstimates=Pi_Parameters FitStatistics=Pi_Summary;
proc glm data=A;
model y = x;
store work.proc_glm_model; /* store the model */
quit;
proc plm restore=work.proc_glm_model;
score data=ScoreX out=Pred predicted=yhat lcl=lower_pred_int lclm=lower_confidence_int ucl=upper_pred_int uclm=upper_confidence_int; /* evaluate the model on new data */
run;
proc plm restore=work.proc_reg_model;
score data=ScoreX out=Pred_lin_reg predicted=yhat lcl=lower_pred_int lclm=lower_confidence_int ucl=upper_pred_int uclm=upper_confidence_int; /* evaluate the model on new data */
run;
I expect identical output datasets from the PROC PLM procedure for both models. the PROC PLM for the proc reg model results in blank data for the confidence and prediction intervals. As can be seen, the final 2 datasets of interest are:
pred_proc_reg (blank values for confidence and prediction intervals)
pred_proc_glm (populated values for confidence and prediction intervals)
I think your issue may be related to this NOTE: from PROC REG STORE statement documentation:
Note: The information stored by the STORE statement in PROC REG is a subset of what is usually stored by other procedures that implement this statement.
In particular, PROC REG stores only the estimated parameters of the model, so that you can later use the CODE statement in PROC PLM to write SAS DATA step
code for prediction to a file or catalog entry.
With only this subset of information, many other postprocessing features of PROC PLM are not available for item stores that are created by PROC REG.
When I run a proc glimmix in SAS, sometimes it drops observations.
How do I get the set of dropped/excluded observations or maybe the set of included observations so that I can identify the dropped set?
My current Proc GLIMMX code is as follows-
%LET EST=inputf.aarefestimates;
%LET MODEL_VAR3 = age Male Yearc2010 HOSPST
Hx_CTSURG Cardiogenic_Shock COPD MCANCER DIABETES;
data work.refmodel;
set inputf.readmref;
Yearc2010 = YEAR - 2010;
run;
PROC GLIMMIX DATA = work.refmodel NOCLPRINT MAXLMMUPDATE=100;
CLASS hospid HOSPST(ref="xx");
ODS OUTPUT PARAMETERESTIMATES = &est (KEEP=EFFECT ESTIMATE STDERR);
MODEL RADM30 = &MODEL_VAR3 /Dist=b LINK=LOGIT SOLUTION;
XBETA=_XBETA_;
LINP=_LINP_;
RANDOM INTERCEPT/SUBJECT= hospid SOLUTION;
OUTPUT OUT = inputf.aar
PRED(BLUP ILINK)=PREDPROB PRED(NOBLUP ILINK)=EXPPROB;
ID XBETA LINP hospst hospid Visitlink Key RADM30;
NLOPTIONS TECH=NRRIDG;
run;
Thank you in advance!
It drops records with missing values in any variable you're using in the model, in a CLASS, BY, MODEL, RANDOM statement. So you can check for missing among those variables to see what you get. Usually the output data set will also indicate this by not having predictions for the records that are not used.
You can run the code below.
*create fake data;
data heart;set sashelp.heart; ;run;
*Logistic Regression model, ageCHDdiag is missing ;
proc logistic data=heart;
class sex / param=ref;
model status(event='Dead') = ageCHDdiag height weight diastolic;
*generate output data;
output out=want p=pred;
run;
*explicitly flag records as included;
data included;
set want;
if missing(pred) then include='N'; else include='Y';
run;
*check that Y equals total obs included above;
proc freq data=included;
table include;
run;
The output will show:
The LOGISTIC Procedure
Model Information
Data Set WORK.HEART
Response Variable Status
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 5209
Number of Observations Used 1446
And then the PROC FREQ will show:
The FREQ Procedure
Cumulative Cumulative
include Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
N 3763 72.24 3763 72.24
Y 1446 27.76 5209 100.00
And 1,446 records are included in both of the data sets.
I think I answered my question.
The code line -
OUTPUT OUT = inputf.aar
gives the output of the model. This table includes all the observations used in the proc statement. So I can match the data in this table to my input table and find the observations that get dropped.
#REEZA - I already looked for missing values for all the columns in the data. Was not able to identify the records there are getting dropped by only identifying the no. of records with missing values. Thanks for the suggestion though.
I'm running this code in SAS:
%let control = A;
%let test = B C D E F;
ods output ParameterEstimates = parms;
proc reg data=reg_data outest=work.model tableout;
model &control = &test / selection= rsquare adjrsq;
run;
proc sql;
create table max_r_square as
select *
from work.model
order by _ADJRSQ_ desc, _RSQ_ desc;
quit;
It effectively goes through all of the combinations of the test variable and then drops the information including R-Squared into a data set. From there I can choose the model that has the highest R-Squared.
My problem is, I can't find a way for the table to include R-Squared and P-Values at the same time while going through all combinations of the test variable.
Taking out the rsquare and adjrsq options gets the p-values in a table, but it keeps SAS from running code on all of the combinations.
I've been looking through the proc reg arguments and options and haven't found anything that works so far.
Is there a way to have SAS run a regression on all combinations of input variables and output R-Squared and P-Values into the same table?