SAS selecting top logit models by AIC - sas

I have a problem with SAS proc logistic.
I was using the following procedures when I had OLS regression and everything worked OK:
proc reg data = input_data outest = output_data;
model y = x1-x25 / selection = cp aic stop = 10;
run;
quit;
Here I wanted SAS to estimate all possible regressions using combinations of 25 regressors (x1-x25) including no more than 10 regressors in model.
Basically, I want to do the same thing (estimate all possible models having 25 regressors with no more than 10 included in a model and output top-models in a dataset with corresponding AIC) but with logistic regression.
I also know that I can use selection = score in Proc Logistic, but I'm not sure how to use outest= then and whether Score Chi-square is really a reliable alternative to cp and AIC in proc reg
So far, I know how to do stepwise/backward/forward logistic regressions, but these methods do not suit me well and btw they display in the output dataset only the top-1 model, while I want at least top-100.
Any help or advice will be highly appreciated!

Related

Output the dropped/excluded observation in Proc GLIMMIX - SAS

When I run a proc glimmix in SAS, sometimes it drops observations.
How do I get the set of dropped/excluded observations or maybe the set of included observations so that I can identify the dropped set?
My current Proc GLIMMX code is as follows-
%LET EST=inputf.aarefestimates;
%LET MODEL_VAR3 = age Male Yearc2010 HOSPST
Hx_CTSURG Cardiogenic_Shock COPD MCANCER DIABETES;
data work.refmodel;
set inputf.readmref;
Yearc2010 = YEAR - 2010;
run;
PROC GLIMMIX DATA = work.refmodel NOCLPRINT MAXLMMUPDATE=100;
CLASS hospid HOSPST(ref="xx");
ODS OUTPUT PARAMETERESTIMATES = &est (KEEP=EFFECT ESTIMATE STDERR);
MODEL RADM30 = &MODEL_VAR3 /Dist=b LINK=LOGIT SOLUTION;
XBETA=_XBETA_;
LINP=_LINP_;
RANDOM INTERCEPT/SUBJECT= hospid SOLUTION;
OUTPUT OUT = inputf.aar
PRED(BLUP ILINK)=PREDPROB PRED(NOBLUP ILINK)=EXPPROB;
ID XBETA LINP hospst hospid Visitlink Key RADM30;
NLOPTIONS TECH=NRRIDG;
run;
Thank you in advance!
It drops records with missing values in any variable you're using in the model, in a CLASS, BY, MODEL, RANDOM statement. So you can check for missing among those variables to see what you get. Usually the output data set will also indicate this by not having predictions for the records that are not used.
You can run the code below.
*create fake data;
data heart;set sashelp.heart; ;run;
*Logistic Regression model, ageCHDdiag is missing ;
proc logistic data=heart;
class sex / param=ref;
model status(event='Dead') = ageCHDdiag height weight diastolic;
*generate output data;
output out=want p=pred;
run;
*explicitly flag records as included;
data included;
set want;
if missing(pred) then include='N'; else include='Y';
run;
*check that Y equals total obs included above;
proc freq data=included;
table include;
run;
The output will show:
The LOGISTIC Procedure
Model Information
Data Set WORK.HEART
Response Variable Status
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 5209
Number of Observations Used 1446
And then the PROC FREQ will show:
The FREQ Procedure
Cumulative Cumulative
include Frequency Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
N 3763 72.24 3763 72.24
Y 1446 27.76 5209 100.00
And 1,446 records are included in both of the data sets.
I think I answered my question.
The code line -
OUTPUT OUT = inputf.aar
gives the output of the model. This table includes all the observations used in the proc statement. So I can match the data in this table to my input table and find the observations that get dropped.
#REEZA - I already looked for missing values for all the columns in the data. Was not able to identify the records there are getting dropped by only identifying the no. of records with missing values. Thanks for the suggestion though.

how can create a prediction interval based on a linear model in SAS

I am trying to create a prediction interval in SAS. My SAS code is
Data M;
input y x;
datalines;
100 20
120 40
125 32
..
;
proc reg;
model y = x / clb clm alpha =0.05;
Output out=want p=Ypredicted;
run;
data want;
set want;
y1= Ypredicted;
proc reg data= want;
model y1 = x / clm cli;
run;
but when I run the code I could find the new Y1 how can I predict the new Y?
What you're trying to do is score your model, which takes the results from the regression and uses them to estimate new values.
The most common way to do this in SAS is simply to use PROC SCORE. This allows you to take the output of PROC REG and apply it to your data.
To use PROC SCORE, you need the OUTEST= option (think 'output estimates') on your PROC REG statement. The dataset that you assign there will be the input to PROC SCORE, along with the new data you want to score.
As Reeza notes in comments, this is covered, along with a bunch of other ways to do this that might work better for you, in Rick Wicklin's blog post, Scoring a regression model in SAS.

SAS: Different Odds Ratio from PROC FREQ & PROC LOGISTIC

I'm working on a project and have run into an expected issue. After running PROC LOGISTIC on my data, I noticed that a few of the odds ratios and regression coefficients seemed to be the inverse of what they should be. After some investigation using PROC FREQ to run the odds ratios, I believe there is some form of error with the odds ratios from PROC LOGISTIC.
The example below is of the response variable "MonthStay" and one of the variables in question "KennelCough". MonthStay = Y and the event of interest is KennelCough = N.
I don't know how to remedy this suspected error. Am I missing something in my code to get the correct calculations? Or am I totally misunderstanding what's going on? Thanks!
Here is the PROC FREQ code and result:
proc freq data = capstone.adopts_dog order = freq;
tables KennelCough*MonthStay / relrisk;
run;
Here is the PROC LOGISTIC CODE and results:
proc logistic data = capstone.adopts_dog plots(only)=(roc(id=prob) effect);
class Breed(ref='Chihuahua') Gender(ref='Female')
Color(ref='Black') Source(ref='Stray') EvalCat(ref='TR') SNAtIn(ref='No')
FoodAggro(ref='Y') AnimalAggro(ref='Y') KennelCough(ref='Y') Dental(ref='Y')
Fearful(ref='Y') Handling(ref='Y') UnderAge(ref='Y') InJuris(ref='Alameda County')
InRegion(ref='East Bay SPCA - Dublin') OutRegion(ref='East Bay SPCA - Dublin')
/ param=ref;
model MonthStay(event='Y') = Age Gender Breed Weight Color Source EvalCat SNatIn
NumBehvCond NumMedCond FoodAggro AnimalAggro KennelCough Dental Fearful
Handling UnderAge Injuris InRegion OutRegion
/ lackfit aggregate scale = none selection = backward rsquare;
output out = probdogs4 PREDPROBS=I reschi = pearson h = leverage;
run;
Class Level Info
Odds Ratios Estimates
In Proc Freq, you are calculating unadjusted odds ratio while in proc logistics, all odds ratio were adjusted for covariates included in the logistic regression model

Can SAS Score a Data Set to an ARIMA Model?

Is it possible to score a data set with a model created by PROC ARIMA in SAS?
This is the code I have that is not working:
proc arima data=work.data;
identify var=x crosscorr=(y(7) y(30));
estimate outest=work.arima;
run;
proc score data=work.data score=work.arima type=parms predict out=pred;
var x;
run;
When I run this code I get an error from the PROC SCORE portion that says "ERROR: Variable x not found." The x column is in the data set work.data.
proc score does not support autocorrelated variables. The simplest way to get an out-of-sample score is to combine both proc arima and a data step. Here's an example using sashelp.air.
Step 1: Generate historical data
We leave out the year 1960 as our score dataset.
data have;
set sashelp.air;
where year(date) < 1960;
run;
Step 2: Generate a model and forecast
The nooutall option tells proc arima to only produce the 12 future forecasts.
proc arima data=have;
identify var=air(12);
estimate p=1 q=(2) method=ml;
forecast lead=12 id=date interval=month out=forecast nooutall;
run;
Step 3: Score
Merge together your forecast and full historical dataset to see how well the model did. I personally like the update statement because it will not replace anything with missing values.
data want;
update forecast(in=fcst)
sashelp.air(in=historical);
by Date;
/* Generate fit statistics */
Error = Forecast-Air;
PctError = Error/Air;
AbsPctError = abs(PctError);
/* Helpful for bookkeeping */
if(fcst) then Type = 'Score';
else if(historical) then Type = 'Est';
format PctError AbsPctError percent8.2;
run;
You can take this code and convert it into a generalized macro for yourself. That way in the future, if you wanted to score something, you could simply call a macro program to get what you need.

SAS proc reg, need to set maximum number of regressors that enter model

I need help with proc reg in sas. Currently I'm using the following code:
proc reg data=input outest=data_output;
model y = x1-x25 / selection = cp;
run;
quit;
I wonder how to set maximum limit of number of regressors that enter my model. Now as you can see I want SAS to test 25 variables, but also I want it to select no more than 7 variables in my model.
And one more questions, does anybody now why SAS outputs only 601 model combinations when I use the procedure above? Why doesn't it show all possible models that it can create with this 25 regressors?
Any comments and help will be appreciated!
Use the STOP= option in the model statement.
proc reg data=input outest=data_output;
model y = x1-x25 / selection = cp stop=7;
run;
quit;