I am trying to understand how Stata calculates both the probability that an alternative is selected and the marginal effect calculated that the mean when I estat mfx after estimating a McFadden / conditional logit model using asclogit.
For example:
asclogit H t, case(ID) alternatives(AQ) casevars(Medicaid) ///
basealternative(1) vce(cluster Medicaid)
estat mfx, varlist(Medicaid)
My goal is to re-create the results by estimating the same model using the clogit and manually calculating the equivalent marginal effects. I am able to reproduce the conditional logit estimates generated by asclogit using clogit but I get stuck reproducing the post estimation calculations.
I have not been able to re-produce computed probability of each alternative being selected, which from reading the documentation for estat mfx I learned is evaluated at the value that is labeled X in the table output.
Here are the estat probability figures:
In case the picture didn't come out:
. matrix baseline = r(pr_1)\r(pr_2)\r(pr_3)\r(pr_4)\r(pr_5)
. matrix list baseline
baseline[5,1]
c1
r1 .04077232
r2 .15206384
r3 .01232535
r4 .10465885
r5 .69017964
Keep in mind that the variables beginning with Valpha and VMedicaid are case specific variables I created for the clogit command. They are respectively the intercept and an indicator for Medicaid coverage.
Here is what I have:
clogit H t Valpha* i.VMedicaid_2 i.VMedicaid_3 i.VMedicaid_4 i.VMedicaid_5 , ///
group(ID) vce(cluster Medicaid)
* Reproducing probability an alternative selected calculated by estat mfx
// calculate covariate means to plug into probability calculations
local V t Valpha* VMedicaid_* Medicaid
foreach var of varlist `V' {
summarize `var'
scalar `var'_MN = r(mean)
}
// alternative specific ZB
scalar zb = _b[t]*t_MN
// numerators attempt 1
foreach j of numlist 2/5 {
scalar XB`j' = exp(zb + (_b[Valpha_`j']*Valpha_`j'_MN) + ///
(_b[1.VMedicaid_`j']*VMedicaid_`j'_MN))
di "this is `j': " XB`j'
}
// numerators attempt 2, documentation for estat mfx said that probably was
// evaluated with Medicaid= 0.68 which is the Medicaid coverage rate across cases
// rather than the mean of the various VMedicaid_ variables are used to estimate
// clogit. Replaced intercept mean with 1
foreach j of numlist 2/5 {
scalar XB`j' = exp(zb + (_b[Valpha_`j']) + (_b[1.VMedicaid_`j' *Medicaid_MN))
di "this is `j': " XB`j'
}
scalar XB1 =exp(zb)
// denominator
scalar DNM = XB1+ XB2+ XB3+ XB4 + XB5
// Baseline
foreach j of numlist 1/5 {
scalar PRB`j' = XB`j'/DNM
di "The probability of choosing hospital `j' is: " PRB`j'
}
The results I get are the following:
The probability of choosing hospital 1 is: .14799075
The probability of choosing hospital 2 is: .21019437
The probability of choosing hospital 3 is: .09046377
The probability of choosing hospital 4 is: .18383085
The probability of choosing hospital 5 is: .36752026
Related
Very new to Stata, so struggling a bit with using fixed effects. The data here is made up, but bear with me. I have a bunch of dummy variables that I am doing regression with. My dependent variable is a dummy that is 1 if a customer bought something and 0 if not. My fixed effects are whether or not there was a yellow sign out front or not (dummy variable again). My independent variable is if the store manager said hi or not (dummy variable).
Basically, I want my output to look like this (with standard errors obviously)
Yellow sign No Yellow sign
Manager said hi estimate estimate
You can use the ## operator in a regression to get a saturated model with fixed effects:
First, input data such that you have a binary outcome (bought), a dependent variable (saidhi), and a fixed effects variable (sign). saidhi should be correlated with your outcome (so there is a portion of saidhi that is uncorrelated with bought and a portion that is), and your FE variable should be correlated with both bought and saidhi (otherwise there is no point having it in your regression if you are only interested in the effect of saidhi).
clear
set obs 100
set seed 45
gen bought = runiform() > 0.5 // Binary y, 50/50 probability
gen saidhi = runiform() + runiform()^2*bought
gen sign = runiform() + runiform()*saidhi + runiform()*bought > 0.66666 // Binary FE, correlated with both x and y
replace saidhi = saidhi > 0.5
Now, run your regression:
* y = x + FE + x*FE + cons
reg bought saidhi##sign, r
exit
Your output should be:
Linear regression Number of obs = 100
F(3, 96) = 13.34
Prob > F = 0.0000
R-squared = 0.1703
Root MSE = .46447
------------------------------------------------------------------------------
| Robust
bought | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.saidhi | .3571429 .2034162 1.76 0.082 -.0466351 .7609209
1.sign | .3869048 .1253409 3.09 0.003 .138105 .6357046
|
saidhi#sign |
1 1 | -.1427489 .2373253 -0.60 0.549 -.6138359 .3283381
|
_cons | .0714286 .0702496 1.02 0.312 -.0680158 .210873
------------------------------------------------------------------------------
1.saidhi is the effect of saidhi when sign == 0. 1.sign is the effect of the sign, alone, i.e. when saidhi == 0. The parts under saidhi#sign describe the interaction between these two variables (i.e. the marginal effect of them both being 1 at the same time... keep in mind the total effect of them both being one includes the previous two terms). Your constant is represents the average value of bought when both are 0 (e.g. this as the same as you would get from sum bought if saidhi == 0 & sign == 0).
I am working with the community-contributed command coefplot in Stata.
I have a large number of estimated coefficients, which I would like to plot on the same graph.
As such, I would like to reduce the spacing between coefficients.
Consider the following toy example using Stata's auto toy dataset:
quietly sysuse auto, clear
quietly regress price mpg trunk length turn
coefplot, drop(_cons) xline(0)
How can the spacing between Mileage (mpg) and Trunk space (cu. ft.) be decreased?
Some white space around the graph is an unavoidable limitation because of how Stata's graphics system works. With that said, an alternative way around this (which does not tinker with the aspect ratio of the graph) is to increase the range of the y-axis.
For example:
forvalues i = 1 / 4 {
coefplot, drop(_cons) xline(0) yscale(range(-`i' `=6+`i''))
}
A different but related approach, is to turn off the y labels entirely and use marker labels instead:
forvalues i = 1 / 4 {
coefplot, drop(_cons) ///
xline(0) ///
yscale(range(-`i' `=6+`i'')) ///
yscale(off) ///
mlabels(mpg = 12 "Mileage" ///
trunk = 12 "Trunk space (cu. ft.)" ///
length = 12 "Length (in.)" ///
turn = 12 "Turn Circle (ft.)")
}
In both approaches, the starting and ending positions (i.e. the amount of space above and below the labels) can be set by tweaking the values specified within the range() suboption.
Note that the grid lines can be turned off by using the option grid(none).
In addition, by combining the at(matrix()) option and yscale(range()) one can allow for unequal reductions in the distance of the coefficients:
matrix A = (0.2,0.21,0.22,0.225,0.255)
coefplot, drop(_cons) ///
xline(0) ///
yscale(range(0.18 0.26)) ///
yscale(off) ///
mlabels(mpg = 12 "Mileage" ///
trunk = 12 "Trunk space (cu. ft.)" ///
length = 12 "Length (in.)" ///
turn = 12 "Turn Circle (ft.)") ///
at(matrix(A)) ///
horizontal
I have a co-variance matrix of variables like this: The values are mirror image across diagonal line. Therefore below diagonal or upper diagonal can be made null for convenience. This is variance-co-variance matrix of the coefficients of a linear regression.
Obs Intercept length diameter height weight_w weight_s
1 0.15510 -0.29969 -0.05904 -0.20594 0.07497 -0.00168
2 -0.29969 3.46991 -3.50836 -0.01703 -0.04841 -0.14048
3 -0.05904 -3.50836 5.08407 -0.82108 -0.13027 0.10732
4 -0.20594 -0.01703 -0.82108 4.89589 -0.29959 0.30447
5 0.07497 -0.04841 -0.13027 -0.29959 0.13787 -0.18763
6 -0.00168 -0.14048 0.10732 0.30447 -0.18763 0.40414
Where Obs 1 – 6 represent Intercept, and five variables length, diameter, height, weight_w, weight_s. Diagonal values are variances. Rest are co-variances between variables and variables and intercept.
I want to create a formula, where number of variables can change and user can input those as parameters. Based on number of variables formula should dynamically expand or contract and calculate result. The formula for five variables will be like this: C1,C2, ...C5 are constant that comes with five variables from outside. These are beta co-efficient of a linear regression. These will vary based on number of variables.
0.15+(C1)^2 * 3.46 + (C2)^2 * 5.08 + (C3)^2 * 4.89 + (C4)^2*0.13 + (C5)^2*0.40 -- Covers all variances
+2*C1*-0.29 + 2*C2*-0.05 + 2*C3*-0.20 + 2*C4*-0.07 + 2*C5*-0.001 -- covers co-variance of all variables with intercept
+2*C1*C2*-3.50 + 2*C1*C3*-0.01 + 2*C1*C4*-0.04 + 2*C1*C5*-0.14 --covers co-variance of “length” with other leftover (minus intercept) variables
+2*C2*C3*-0.82+ 2*C2*C4*-0.13 + 2*C2*C5*0.10 -- covers co-variance of “diameter” with leftover variables
+2*C3*C4*-0.29+ 2*C3*C5*0.30 -- covers co-variance of “height” with leftovers
+2*C4*C5*-0.18 -- covers co-variance of weight_w & weight_s
Those five constants, matching to five variables, are inserted from outside. Rest are coming from co-variance matrix. In the co-variance matrix table the diagonal values are variances of those variables. Rest are co-variances. In the formula, you can see that where there are variances I have taken square of constants (Cs). Where there are co-variance, of two variables involved, the respective constants (Cs) multiple. Intercept is another term that comes with these variables. But intercept doesn't have any "C".
For two variables the co-variance matrix will be like this: Intercept will be there too.
Obs Intercept GRE GPA
1 1.15582 -.000281894 -0.28256
2 -0.00028 0.000001118 -0.00011
3 -0.28256 -.000114482 0.10213
Formula for calculation:
1.15582+(C1)^2 * 0.000001118 + (C2)^2 * 0.10213 -- Covers all variances on diagonal line
+2*C1*-.000281894 + 2*C2*-0.28256 -- covers co-variance of all variables with intercept
+2*C1*C2*-0.00011 -- covers co-variance between variable GRE & GPA
I have the following model: ln(MPG_{i}) = \beta _{0} + \beta {1}WEIGHT{i} + \beta {1}FOREIGN{i} + \beta {3}FOREIGN{i} * WEIGHT_{i} + \varepsilon_{i,j}
I want to use the test command to test whether the coefficient on $\beta_{3} >0.5$ in STATA.
I have used the following code and obtain this result:
test 1.foreign#c.weight = 0.5001
( 1) 1.foreign#c.weight = .5001
F( 1, 70) = 4.9e+07
Prob > F = 0.0000
So we reject our null, since the p-value is very small.
But the problem is that this for a two-tailed test.
My goal is to get the t-test value for this left-tailed test and then to store it. And then use the t-test to compute its p-value.
After computing the p-value, I decide whether or not to reject the null. I am certain that I would reject the null and the p-value would be quite small. Just need some help in figuring out how to code it the right way.
EDIT: I have tried using these commands:
lincom _b[forweight] - 0.5
display invttail(71, 0.5)
The last command spits out a value of 0. Now is this the p-value of the left-sided t-test?
This is covered in this FAQ.
Here's the relevant code for testing that a coefficient is 0.5:
sysuse auto, clear
gen ln_mpg = ln(mpg)
regress ln_mpg i.foreign##c.weight
test _b[1.foreign#c.weight]=.5
local sign_wgt = sign(1.foreign#c.weight)
display "Ho: coef <= 0.5 p-value = " ttail(r(df_r),`sign_wgt'*sqrt(r(F)))
display "Ho: coef >= 0.5 p-value = " 1-ttail(r(df_r),`sign_wgt'*sqrt(r(F)))
I am trying to calculate z-scores by creating a variable D from 3 other variables, namely A, B, and C. I am trying to generate D as : D= (A-B)/C but for some reason when I do it, it produces very large numbers. When I did just (A-B) it did not get what it should have when I calculated by hand, instead of -2, I for -105.66.
Variable A is 'long' and variable B is 'float', I am not sure if this is the reason? My stata syntax is:
gen zscore= (height-avheight)/meansd
did not work.
You are confusing scalars and variables. Here's a solution (chop off the first four lines and replace x by height to fit the calculation into your code):
// example data
clear
set obs 50
gen x = runiform()
// summarize
qui su x
// store scalars
sca de mu = r(mean)
sca de sd = r(sd)
// z-score
gen zx = (x - mu) / sd
su zx
x and its z-score zx are variables that take many values, whereas mu and sd are constants. You might code constants in Stata by using scalars or macros.
I am not sure what you are trying to get, but I will use the auto data from Stata to explain. This is basic stuff in Stata. Say I want to test that the price=3
sysuse auto
sum price
#return list which is optional command
scalar myz=(3-r(mean))/r(sd) #r(mean) and r(sd) gives the mean and sd of price, if that is given you can simply enter the value for that
dis myz
-2.0892576
So, z value is -2.09 here.