Evaluating the Fractional Logit Model - McFadden's Adjusted R^2 - stata

I am estimating a model where the dependent variable is a fraction (between 0 and 1). I used the commands in Stata 14.1
glm y x, link(logit) family(binomial) robust nolog
as well as
fracreg logit y x, vce(robust)
Both commands deliver the same results.
Now I want to evaluate the outcome, ideally with McFadden's adjusted r^2. Yet, neither fitstat nor estat gof seem to work after I run the regressions. I get the error message fitstat does not work with the last model estimated and not available after fracreg r(321).
Does any of you know an alternative command for McFadden's adjusted r^2?
Or do I have to use a different evaluation method?

It appears that the pseudo-R-squared that appears in the fracreg output is McFadden's pseudo R squared. I'm not sure if this is the same as the McFadden's adjusted r^2 that you mention.
You can see it is McFadden's pseudo-R-squared from investigating the maximize command as suggested by #nick-cox's post on Stata.com. In the reference manual for maximize, page 1478 (Stata 14) it says:
Let L1 be the log likelihood of the full model (that is, the log-likelihood value shown on the output), and let L0 be the log likelihood of the “constant-only” model. ... The pseudo-R2 (McFadden 1974) is defined as 1 - L1 / L0. This is simply the log likelihood on a scale where 0 corresponds to the “constant-only” model and 1 corresponds to perfect prediction for a discrete model (in which case the overall log likelihood is 0).
If this is what you are looking for, this value may be pulled out using
fracreg logit y x, vce(robust)
scalar myRsquared = e(r2_p)

To adjust the McFadden's R^2, you just need to subtract the number of predictors from the full model log likelihood in the numerator of the fractional part. The formula is here. Note that you may get negative values.
Here's how you might do that:
set more off
webuse set http://fmwww.bc.edu/repec/bocode/w
webuse wedderburn, clear
/* (1) Fracreg Way */
fracreg logit yield i.site i.variety, nolog
di "Fracreg McFadden's Adj. R^2:" %-9.3f 1-(e(ll)-e(k))/(e(ll_0))
/* (2) GLM Way */
glm yield, link(logit) family(binomial) robust nolog // intercept only model
local ll_0 = e(ll)
glm yield i.site i.variety, link(logit) family(binomial) robust nolog // full model
di "McFadden's Adj. R^2: " %-9.3f 1-(e(ll)-e(k))/`ll_0'
The GLM R^2 will be slightly different because the maximization algorithm is different and so the likelihoods will be different as well. I am not sure how to tweak the ML options so they match exactly.
You can verify that we did things correctly with a command where fitstat works:
sysuse auto, clear
logit foreign price mpg
fitstat
di "McFadden's Adj. R^2: " %-9.3f 1-(e(ll)-e(k))/(e(ll_0))

Related

Value of coefficient (Beta1) at different values of other covariate (X2), hopefully graphed

(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Stata seems to be ignoring my starting values in maximum likelihood estimation

I am trying to estimate a maximum likelihood model and it is running into convergence problems in Stata. The actual model is quite complicated, but it converges with no troubles in R when it is supplied with appropriate starting values. I however cannot seem to get Stata to accept the starting values I provide.
I have included a simple example below estimating the mean of a poisson distribution. This is not the actual model I am trying to estimate, but it demonstrates my problem. I set the trace variable, which allows you to see the parameters as Stata searches the likelihood surface.
Although I use init to set a starting value of 0.5, the first iteration still shows that Stata is trying a coefficient of 4.
Why is this? How can I force the estimation procedure to use my starting values?
Thanks!
generate y = rpoisson(4)
capture program drop mypoisson
program define mypoisson
args lnf mu
quietly replace `lnf' = $ML_y1*ln(`mu') - `mu' - lnfactorial($ML_y1)
end
ml model lf mypoisson (mean:y=)
ml init 0.5, copy
ml maximize, iterations(2) trace
Output:
Iteration 0:
Parameter vector:
mean:
_cons
r1 4
Added: Stata doesn't ignore the initial value. If you look at the output of the ml maximize command, the first line in the listing will be titled
initial: log likelihood =
Following the equal sign is the value of the likelihood for the parameter value set in the init statement.
I don't know how the search(off) or search(norescale) solutions affect the subsequent likelihood calculations, so these solution might still be worthwhile.
Original "solutions":
To force a start at your initial value, add the search(off) option to ml maximize:
ml maximize, iterate(2) trace search(off)
You can also force a use of the initial value with search(norescale). See Jeff Pitblado's post at http://www.stata.com/statalist/archive/2006-07/msg00499.html.

Equivalent R^2 for Logit Regression in Stata

I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic

Add depvar control mean to regression table output

This is a 2-arm randomized control trial. In my regression output I want to evaluate the relative reduction in risk of disease for those in the treatment group. To make this evaluation easier I would like to add the dependent variable control mean to the foot of the regression table output. I am currently using estadd with estout. Below is my code, which displays the mean of the dependent variable, however I cannot find any options for estadd, estpost, etc that allow me to restrict the depvar mean calculation for only one arm of the study (i.e. control arm).
eststo, title(" "): xi: quietly reg X `covariates' if survid==1, vce(cluster id1)
estadd ysumm
estout using $outdir\results.txt, replace ///
cells("b(fmt(3) label (Coeff.)) se(fmt(3) star label (s.e.))") ///
drop(_Itt* _cons) ///
starlevels(+ 0.10 * 0.05) ///
stats(N ymean, labels ("N. of obs." "Control Mean")) ///
label legend
You are being spoiled by the marvelous functionality offered by estadd, eststo, etc. :). How about this:
xi: quietly reg X `covariates' if survid==1, vce(cluster id1)
// two prefixes in the same command is like a sentence with three subordinate clauses
// that just rolls from one line to the next without giving you a chance to
// catch a breath and really understand what is going on in this sentence,
// which is a sign of poor writing skills, unless you are Dostoevsky, which I am not.
estimates store CtrlArm
// it is also a good idea to be specific about what it is that you want to output.
// Thus I have the -estimates store- on a separate line with a specific name for your results.
summarize X if survid==1
estadd scalar ymean = r(mean)
estout CtrlArm using $outdir\results.txt, ...
estadd and estout are unavoidable. Your initial eststo with an empty title just takes space, though, and does not help anything.