Fixed effects in Stata - stata

Very new to Stata, so struggling a bit with using fixed effects. The data here is made up, but bear with me. I have a bunch of dummy variables that I am doing regression with. My dependent variable is a dummy that is 1 if a customer bought something and 0 if not. My fixed effects are whether or not there was a yellow sign out front or not (dummy variable again). My independent variable is if the store manager said hi or not (dummy variable).
Basically, I want my output to look like this (with standard errors obviously)
Yellow sign No Yellow sign
Manager said hi estimate estimate

You can use the ## operator in a regression to get a saturated model with fixed effects:
First, input data such that you have a binary outcome (bought), a dependent variable (saidhi), and a fixed effects variable (sign). saidhi should be correlated with your outcome (so there is a portion of saidhi that is uncorrelated with bought and a portion that is), and your FE variable should be correlated with both bought and saidhi (otherwise there is no point having it in your regression if you are only interested in the effect of saidhi).
clear
set obs 100
set seed 45
gen bought = runiform() > 0.5 // Binary y, 50/50 probability
gen saidhi = runiform() + runiform()^2*bought
gen sign = runiform() + runiform()*saidhi + runiform()*bought > 0.66666 // Binary FE, correlated with both x and y
replace saidhi = saidhi > 0.5
Now, run your regression:
* y = x + FE + x*FE + cons
reg bought saidhi##sign, r
exit
Your output should be:
Linear regression Number of obs = 100
F(3, 96) = 13.34
Prob > F = 0.0000
R-squared = 0.1703
Root MSE = .46447
------------------------------------------------------------------------------
| Robust
bought | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.saidhi | .3571429 .2034162 1.76 0.082 -.0466351 .7609209
1.sign | .3869048 .1253409 3.09 0.003 .138105 .6357046
|
saidhi#sign |
1 1 | -.1427489 .2373253 -0.60 0.549 -.6138359 .3283381
|
_cons | .0714286 .0702496 1.02 0.312 -.0680158 .210873
------------------------------------------------------------------------------
1.saidhi is the effect of saidhi when sign == 0. 1.sign is the effect of the sign, alone, i.e. when saidhi == 0. The parts under saidhi#sign describe the interaction between these two variables (i.e. the marginal effect of them both being 1 at the same time... keep in mind the total effect of them both being one includes the previous two terms). Your constant is represents the average value of bought when both are 0 (e.g. this as the same as you would get from sum bought if saidhi == 0 & sign == 0).

Related

GMM program evaluator does not like my temporary variable

As far as I can tell, the two programs in the code below are identical. In the first program, I just assign the parameter to a scalar. In the second program, I store this scalar for each observation in a temporary variable.
Mathematically, that should be the same, yet the second program produces "numerical derivatives are approximate" and "flat or discontinuous region encountered".
Why cannot the derivatives be computed properly in the second approach?
clear
set obs 10000
set seed 42
gen x = runiform() * 10
gen eps = rnormal()
gen y = 2 + .3 * x + eps
capture program drop testScalar
program testScalar
syntax varlist [if], at(name)
scalar b0 = `at'[1,1]
scalar b1 = `at'[1,2]
replace `varlist' = y - b0 - b1* x
end
capture program drop testTempvar
program testTempvar
syntax varlist [if], at(name)
tempvar tmp
scalar b0 = `at'[1,1]
scalar b1 = `at'[1,2]
gen `tmp' = b1
replace `varlist' = y - b0 - `tmp'* x
end
gmm testScalar, nequations(1) nparameters(2) instr(x) winitial(identity) onestep
gmm testTempvar, nequations(1) nparameters(2) instr(x) winitial(identity) onestep
Output:
. gmm testScalar, nequations(1) nparameters(2) instr(x) winitial(identity) onestep
(10,000 real changes made)
Step 1
Iteration 0: GMM criterion Q(b) = 417.93313
Iteration 1: GMM criterion Q(b) = 1.690e-23
Iteration 2: GMM criterion Q(b) = 3.568e-30
note: model is exactly identified
GMM estimation
Number of parameters = 2
Number of moments = 2
Initial weight matrix: Identity Number of obs = 10,000
------------------------------------------------------------------------------
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/b1 | 2.022865 .0200156 101.06 0.000 1.983635 2.062095
/b2 | .2981147 .003465 86.04 0.000 .2913235 .3049059
------------------------------------------------------------------------------
Instruments for equation 1: x _cons
. gmm testTempvar, nequations(1) nparameters(2) instr(x) winitial(identity) onestep
(10,000 real changes made)
Step 1
Iteration 0: GMM criterion Q(b) = 417.93313
numerical derivatives are approximate
flat or discontinuous region encountered
Iteration 1: GMM criterion Q(b) = 8.073e-17
numerical derivatives are approximate
flat or discontinuous region encountered
Iteration 2: GMM criterion Q(b) = 8.073e-17 (backed up)
note: model is exactly identified
GMM estimation
Number of parameters = 2
Number of moments = 2
Initial weight matrix: Identity Number of obs = 10,000
------------------------------------------------------------------------------
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/b1 | 2.022865 .0201346 100.47 0.000 1.983402 2.062328
/b2 | .2981147 .0034933 85.34 0.000 .291268 .3049613
------------------------------------------------------------------------------
Instruments for equation 1: x _cons
.
In the program testTempvar you need to generate the temporary variable tmp as type double:
generate double `tmp' = b1
In other words, this is a precision problem.

Marginal effects from estat mfx after asclogit

I am trying to understand how Stata calculates both the probability that an alternative is selected and the marginal effect calculated that the mean when I estat mfx after estimating a McFadden / conditional logit model using asclogit.
For example:
asclogit H t, case(ID) alternatives(AQ) casevars(Medicaid) ///
basealternative(1) vce(cluster Medicaid)
estat mfx, varlist(Medicaid)
My goal is to re-create the results by estimating the same model using the clogit and manually calculating the equivalent marginal effects. I am able to reproduce the conditional logit estimates generated by asclogit using clogit but I get stuck reproducing the post estimation calculations.
I have not been able to re-produce computed probability of each alternative being selected, which from reading the documentation for estat mfx I learned is evaluated at the value that is labeled X in the table output.
Here are the estat probability figures:
In case the picture didn't come out:
. matrix baseline = r(pr_1)\r(pr_2)\r(pr_3)\r(pr_4)\r(pr_5)
. matrix list baseline
baseline[5,1]
c1
r1 .04077232
r2 .15206384
r3 .01232535
r4 .10465885
r5 .69017964
Keep in mind that the variables beginning with Valpha and VMedicaid are case specific variables I created for the clogit command. They are respectively the intercept and an indicator for Medicaid coverage.
Here is what I have:
clogit H t Valpha* i.VMedicaid_2 i.VMedicaid_3 i.VMedicaid_4 i.VMedicaid_5 , ///
group(ID) vce(cluster Medicaid)
* Reproducing probability an alternative selected calculated by estat mfx
// calculate covariate means to plug into probability calculations
local V t Valpha* VMedicaid_* Medicaid
foreach var of varlist `V' {
summarize `var'
scalar `var'_MN = r(mean)
}
// alternative specific ZB
scalar zb = _b[t]*t_MN
// numerators attempt 1
foreach j of numlist 2/5 {
scalar XB`j' = exp(zb + (_b[Valpha_`j']*Valpha_`j'_MN) + ///
(_b[1.VMedicaid_`j']*VMedicaid_`j'_MN))
di "this is `j': " XB`j'
}
// numerators attempt 2, documentation for estat mfx said that probably was
// evaluated with Medicaid= 0.68 which is the Medicaid coverage rate across cases
// rather than the mean of the various VMedicaid_ variables are used to estimate
// clogit. Replaced intercept mean with 1
foreach j of numlist 2/5 {
scalar XB`j' = exp(zb + (_b[Valpha_`j']) + (_b[1.VMedicaid_`j' *Medicaid_MN))
di "this is `j': " XB`j'
}
scalar XB1 =exp(zb)
// denominator
scalar DNM = XB1+ XB2+ XB3+ XB4 + XB5
// Baseline
foreach j of numlist 1/5 {
scalar PRB`j' = XB`j'/DNM
di "The probability of choosing hospital `j' is: " PRB`j'
}
The results I get are the following:
The probability of choosing hospital 1 is: .14799075
The probability of choosing hospital 2 is: .21019437
The probability of choosing hospital 3 is: .09046377
The probability of choosing hospital 4 is: .18383085
The probability of choosing hospital 5 is: .36752026

How to simulate pairs from a joint distribution

I have two normal distributions X and Y and with a given covariance between them and variances for both X and Y, I want to simulate (say 200 points) of pairs of points from the joint distribution, but I can't seem to find a command/way to do this. I want to eventually plot these points in a scatter plot.
so far, I have
set obs 100
set seed 1
gen y = 64*rnormal(1, 5/64)
gen x = 64*rnromal(1, 5/64)
matrix D = (1, .5 | .5, 1)
drawnorm x, y, cov(D)
but this makes an error saying that x and y already exist.
Also, once I have a sample, how would I plot the drawnorm output as a scatter?
A related approach for generating correlated data is to use the corr2data command:
clear
set obs 100
set seed 1
matrix D = (1, .5 \ .5, 1)
drawnorm x1 y1, cov(D)
corr2data x2 y2, cov(D)
. summarize x*
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
x1 | 100 .0630304 1.036762 -2.808194 2.280756
x2 | 100 1.83e-09 1 -2.332422 2.238905
. summarize y*
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y1 | 100 -.0767662 .9529448 -2.046532 2.726873
y2 | 100 3.40e-09 1 -2.492884 2.797518
It is important to note that unlike drawnorm, the corr2data approach does not generate data that is a sample from an underlying population.
You can then create a scatter plot as follows:
scatter x1 y1
Or to compare the two approaches in a single graph:
twoway scatter x1 y1 || scatter x2 y2
EDIT:
For specific means and variances you need to specify the mean vector μ and covariance matrix Σ in drawnorm. For example, to draw two random variables that are jointly normally distributed with means of 8 and 12, and variances 5 and 8 respectively, you type:
matrix mu = (8, 12)
scalar cov = 0.4 * sqrt(5 * 8) // assuming a correlation of 0.4
matrix sigma = (5, cov \ cov, 8)
drawnorm double x y, means(mu) cov(sigma)
The mean and cov options of drawnorm are both documented in the help file.
Here is an almost minimal example:
. clear
. set obs 100
number of observations (_N) was 0, now 100
. set seed 1
. matrix D = (1, .5 \ .5, 1)
. drawnorm x y, cov(D)
As the help for drawnorm explains, you must supply new variable names. As x and y already exist, drawnorm threw you out. You also had a superfluous comma that would have triggered a syntax error.
help scatter tells you about scatter plots.

Regressing state-level coefficient on state-level law

I have the following.
county state employment county_shock state_law
1 NY 70 3 10
2 NY 80 4 10
4 IL 100 2 5
7 IL 60 9 5
3 TX 90 8 2
I ran the regression for all counties:
regress employment county_shock
But now I am curious about the state-level law on the degree to which county_shock affects employment.
Not sure adding interaction term achieves this.
But what I am trying to do is the following:
Run
regress employment county_shock
for "each state"
Then I will have coefficient for county_shock for each state. I could get those coefficients by "_b"
Then, regress those coefficients on state-level law.
How should I do this?
This is what I thought at first:
The initial model is y = b0 + b1*x + u, where x is county_shock for short. You mean b1 is a function of z, which is state_law. For this you can regress y on x, z and x*z. (Here you want to let the intercept different across state as well. The model without z looks strange because it means that the intercept is the same for all states.)
Now, let the model with interaction be y = c0 + c1*x + c2*z + c3*x*z + u. This is written as y = (c0 + c2*z) + (c1+c3*z)*x + u. Thus, c3 is the coefficient you want.
In Stata,
reg employment c.county_shock##c.state_law
-
After reading the second part of your question, I am not sure if this is what you want.

How do I calculate the p-value of a one-tailed test in Stata?

I have the following model: ln(MPG_{i}) = \beta _{0} + \beta {1}WEIGHT{i} + \beta {1}FOREIGN{i} + \beta {3}FOREIGN{i} * WEIGHT_{i} + \varepsilon_{i,j}
I want to use the test command to test whether the coefficient on $\beta_{3} >0.5$ in STATA.
I have used the following code and obtain this result:
test 1.foreign#c.weight = 0.5001
( 1) 1.foreign#c.weight = .5001
F( 1, 70) = 4.9e+07
Prob > F = 0.0000
So we reject our null, since the p-value is very small.
But the problem is that this for a two-tailed test.
My goal is to get the t-test value for this left-tailed test and then to store it. And then use the t-test to compute its p-value.
After computing the p-value, I decide whether or not to reject the null. I am certain that I would reject the null and the p-value would be quite small. Just need some help in figuring out how to code it the right way.
EDIT: I have tried using these commands:
lincom _b[forweight] - 0.5
display invttail(71, 0.5)
The last command spits out a value of 0. Now is this the p-value of the left-sided t-test?
This is covered in this FAQ.
Here's the relevant code for testing that a coefficient is 0.5:
sysuse auto, clear
gen ln_mpg = ln(mpg)
regress ln_mpg i.foreign##c.weight
test _b[1.foreign#c.weight]=.5
local sign_wgt = sign(1.foreign#c.weight)
display "Ho: coef <= 0.5 p-value = " ttail(r(df_r),`sign_wgt'*sqrt(r(F)))
display "Ho: coef >= 0.5 p-value = " 1-ttail(r(df_r),`sign_wgt'*sqrt(r(F)))