I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic
Related
My question concerns use of VIF test for multicolinearity diagnostics when the model suffers from heteroskedasticity.
I want to use HAC correction to account for heteroskedasticity in my model. However VIF gives my starkly different results depending if I run it after estimating the model with simple OLS without error correction, compared to when I start with regression with HAC applied and then run VIF. I use Eviews.
For me it was surprising as the test statistic in VIF is just an 1/(1-R^2) where R^2 is calculated for a model in which given x_i variable is regressed against the rest of X variables. This implies that the result should not depend on standard errors of the estimated parameters in our original y against X regression, and thus should not depend on whether I use robust errors or not.
However, in Eviews VIF is calculted differently and estimates of standard errors for parameters are used (tutorial, pp. 198 of the pdf). While it is suggested that both approaches are equivalent, clearly is not the case in my example.
In short, in which order should I proceed - first test for multicolinearity with simple OLS model and then move on to model with HAC, or the other way - estimate the model with HAC and then run VIF?
Thanks for all your help!
I am generating alerts by reading dataset for KPI (key performance indicator) . My algorithm is looking into historical data and based on that I am able to capture if there's sudden spike in data. But I am generating false alarms . For example KPI1 is historically at .5 but reaches value 12, which is kind of spike .
Same way KPI2 also reaches from .5 to 12. But I know that KPI reaching from .5 to 12 is not a big deal and I need not to capture that . same way KPI2 reaching from .5 to 12 is big deal and I need to capture that.
I want to train my program to understand what is high value , low value or normal value for each KPI.
Could you experts tell me which is best ML algorithm is for this and any package in python I need to explore?
This is the classification problem. You can use classic logistic regression algorithm to classify any given sample into either high value, low value or normal value.
Quoting from the Wikipedia,
In statistics, multinomial logistic regression is a classification
method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is
a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable,
given a set of independent variables (which may be real-valued,
binary-valued, categorical-valued, etc.)
To perform multi-class classification in python, sklearn library can be useful.
http://scikit-learn.org/stable/modules/multiclass.html
(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.
I am estimating a model where the dependent variable is a fraction (between 0 and 1). I used the commands in Stata 14.1
glm y x, link(logit) family(binomial) robust nolog
as well as
fracreg logit y x, vce(robust)
Both commands deliver the same results.
Now I want to evaluate the outcome, ideally with McFadden's adjusted r^2. Yet, neither fitstat nor estat gof seem to work after I run the regressions. I get the error message fitstat does not work with the last model estimated and not available after fracreg r(321).
Does any of you know an alternative command for McFadden's adjusted r^2?
Or do I have to use a different evaluation method?
It appears that the pseudo-R-squared that appears in the fracreg output is McFadden's pseudo R squared. I'm not sure if this is the same as the McFadden's adjusted r^2 that you mention.
You can see it is McFadden's pseudo-R-squared from investigating the maximize command as suggested by #nick-cox's post on Stata.com. In the reference manual for maximize, page 1478 (Stata 14) it says:
Let L1 be the log likelihood of the full model (that is, the log-likelihood value shown on the output), and let L0 be the log likelihood of the “constant-only” model. ... The pseudo-R2 (McFadden 1974) is defined as 1 - L1 / L0. This is simply the log likelihood on a scale where 0 corresponds to the “constant-only” model and 1 corresponds to perfect prediction for a discrete model (in which case the overall log likelihood is 0).
If this is what you are looking for, this value may be pulled out using
fracreg logit y x, vce(robust)
scalar myRsquared = e(r2_p)
To adjust the McFadden's R^2, you just need to subtract the number of predictors from the full model log likelihood in the numerator of the fractional part. The formula is here. Note that you may get negative values.
Here's how you might do that:
set more off
webuse set http://fmwww.bc.edu/repec/bocode/w
webuse wedderburn, clear
/* (1) Fracreg Way */
fracreg logit yield i.site i.variety, nolog
di "Fracreg McFadden's Adj. R^2:" %-9.3f 1-(e(ll)-e(k))/(e(ll_0))
/* (2) GLM Way */
glm yield, link(logit) family(binomial) robust nolog // intercept only model
local ll_0 = e(ll)
glm yield i.site i.variety, link(logit) family(binomial) robust nolog // full model
di "McFadden's Adj. R^2: " %-9.3f 1-(e(ll)-e(k))/`ll_0'
The GLM R^2 will be slightly different because the maximization algorithm is different and so the likelihoods will be different as well. I am not sure how to tweak the ML options so they match exactly.
You can verify that we did things correctly with a command where fitstat works:
sysuse auto, clear
logit foreign price mpg
fitstat
di "McFadden's Adj. R^2: " %-9.3f 1-(e(ll)-e(k))/(e(ll_0))
I'm using Heckman Selection Model which are two consist of 2 equation. i'm using Probit as a selection equation and multiple regression as a result equation.
how can put in dummy variables in those equation ?
Do we have to make the variables into logaritmic form ?
How can I make logaritmic variables with stata ?
Thank you..
Here's an example of how you might do what you ask. The example looks at the effect of being a union member on log wages:
webuse union3
gen log_wage = ln(wage)
etregress log_wage age grade i.smsa i.black tenure, treat(union = i.south i.black tenure) twostep
etregress estimates an average treatment effect of an endogenous binary-treatment variable. In plain English, that means the "first-stage" is a probit. Estimation is by either full maximum likelihood or a two-step consistent estimator, as above.
The dummies are created on the fly by putting an i. in front of the covariates. This is called factor variable notation, and it also makes interactions a breeze. You can also do tab race, gen(d_) to create d_1, d_2, and d_3 (3 race dummies, one of which you can drop).