I have used the following statement to calculate predicted values of a logistic model
proc logistic data = dev descending outest =model;
class cat_vars;
Model dep = cont_var cat_var / selection = stepwise slentry=0.1 slstay=0.1
stb lackfit;
output out = tmp p= probofdefault;
Score data=dev out = Logit_File;
run;
I want to know what would be the interpretation of the probabilities i get in the logit_file . Are those probabilities odds ratio ( exp(y)) or are they the probabilities (odds ratio/1+odds ratio)?
Probabilities cannot be odds ratios. A probability is between 0 and 1, odds ratios have no upper bound. The output from SCORE are probabilities.
If you consider the reason for there being a SCORE option in the first place, this should make sense: SCORE is designed to score new data sets using an old model. It uses the odds ratios and so on of the old model on a new data set.
Related
I have a linear model that predicts the stat "buy" that can only have 0 or 1 as an outcome.
All my predictions are some rational numbers like 0.2 or something like that. My idea would be to make every number of 0.5 or above a 1 and everything else a 0. How can I do that?
My idea was with an if then else statement but at least like that it doesn't work:
proc GLMSELECT data=tasks.data plots=all;
by descending gender;
model buy = affiliate age;
if buy >=0.5 then buy = 1 else 0;
run;
How do I need to code it so it works?
You don't need a linear regression model, you need a logistic regression model. Linear regression is designed for continuous outcomes, not binary events. I highly recommend reading up about the differences between categorical and continuous models.
proc logistic data=sashelp.heart plots=all;
class BP_Status Chol_Status Smoking_Status ;
model status(event='Dead') = Height Weight BP_Status Chol_Status Smoking_Status
/ selection=stepwise sle=0.1 sls=0.05
;
run;
I'm working on a project and have run into an expected issue. After running PROC LOGISTIC on my data, I noticed that a few of the odds ratios and regression coefficients seemed to be the inverse of what they should be. After some investigation using PROC FREQ to run the odds ratios, I believe there is some form of error with the odds ratios from PROC LOGISTIC.
The example below is of the response variable "MonthStay" and one of the variables in question "KennelCough". MonthStay = Y and the event of interest is KennelCough = N.
I don't know how to remedy this suspected error. Am I missing something in my code to get the correct calculations? Or am I totally misunderstanding what's going on? Thanks!
Here is the PROC FREQ code and result:
proc freq data = capstone.adopts_dog order = freq;
tables KennelCough*MonthStay / relrisk;
run;
Here is the PROC LOGISTIC CODE and results:
proc logistic data = capstone.adopts_dog plots(only)=(roc(id=prob) effect);
class Breed(ref='Chihuahua') Gender(ref='Female')
Color(ref='Black') Source(ref='Stray') EvalCat(ref='TR') SNAtIn(ref='No')
FoodAggro(ref='Y') AnimalAggro(ref='Y') KennelCough(ref='Y') Dental(ref='Y')
Fearful(ref='Y') Handling(ref='Y') UnderAge(ref='Y') InJuris(ref='Alameda County')
InRegion(ref='East Bay SPCA - Dublin') OutRegion(ref='East Bay SPCA - Dublin')
/ param=ref;
model MonthStay(event='Y') = Age Gender Breed Weight Color Source EvalCat SNatIn
NumBehvCond NumMedCond FoodAggro AnimalAggro KennelCough Dental Fearful
Handling UnderAge Injuris InRegion OutRegion
/ lackfit aggregate scale = none selection = backward rsquare;
output out = probdogs4 PREDPROBS=I reschi = pearson h = leverage;
run;
Class Level Info
Odds Ratios Estimates
In Proc Freq, you are calculating unadjusted odds ratio while in proc logistics, all odds ratio were adjusted for covariates included in the logistic regression model
I have a problem with SAS proc logistic.
I was using the following procedures when I had OLS regression and everything worked OK:
proc reg data = input_data outest = output_data;
model y = x1-x25 / selection = cp aic stop = 10;
run;
quit;
Here I wanted SAS to estimate all possible regressions using combinations of 25 regressors (x1-x25) including no more than 10 regressors in model.
Basically, I want to do the same thing (estimate all possible models having 25 regressors with no more than 10 included in a model and output top-models in a dataset with corresponding AIC) but with logistic regression.
I also know that I can use selection = score in Proc Logistic, but I'm not sure how to use outest= then and whether Score Chi-square is really a reliable alternative to cp and AIC in proc reg
So far, I know how to do stepwise/backward/forward logistic regressions, but these methods do not suit me well and btw they display in the output dataset only the top-1 model, while I want at least top-100.
Any help or advice will be highly appreciated!
I need to calculate the following for my dataset. I could calculate individual PPV (95% CI) and NPV (95% CI) but got tad confused about how to calculate this:
PPV+NPV-1 (95% CI)
How do I do this calculation?
This page on SAS support gives code as follows:
title 'Sensitivity';
proc freq data=FatComp;
where Response=1;
weight Count;
tables Test / binomial(level="1");
exact binomial;
run;
title 'Specificity';
proc freq data=FatComp;
where Response=0;
weight Count;
tables Test / binomial(level="0");
exact binomial;
run;
title 'Positive predictive value';
proc freq data=FatComp;
where Test=1;
weight Count;
tables Response / binomial(level="1");
exact binomial;
run;
title 'Negative predictive value';
proc freq data=FatComp;
where Test=0;
weight Count;
tables Response / binomial(level="0");
exact binomial;
run;
I doubt that this is a useful measure. In general you should present sensitivity, specificity, positive and negative predictive values. If you want a global measure of accuracy you should go for the proportion of correctly classified subjects.
If you go in the webpage already suggested by Peter Flom yo can scroll until a piece of code for overall accuracy. The accuracy can be computed by creating a binary variable indicating whether test and response agree in each observation. :
data acc;
set FatComp;
if (test and response) or
(not test and not response) then acc=1;
else acc=0;
run;
proc freq;
weight count;
tables acc / binomial(level="1");
exact binomial;
run;
Hope it helps
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;