Calculating standard deviations in Stata to approximate beta distributions - stata

My question relates to calculating the standard deviation (SD) of transition probabilities derived from coefficients estimated through Weibull regression in Stata.
The transition probabilities are being used to model disease progression of leukemia patients over 40 cycles of 90 days (about 10 years). I need the SDs of the probabilities (which change over the run of the Markov model) to create beta distributions whose parameters can be approximated using the corresponding Markov cycle probability and its SD. These distributions are then used to do Probabilistic sensitivity analysis, i.e., they are substituted for the simple probabilities (one for each cycle) and random draws from them can evaluate the robustness of the model’s cost-effectiveness results.
Anyway, using time to event survival data, I’ve used regression analysis to estimate coefficients that can be plugged into an equation to generate transition probabilities. For example...
. streg, nohr dist(weibull)
failure _d: event
analysis time _t: time
Fitting constant-only model:
Iteration 0: log likelihood = -171.82384
Iteration 1: log likelihood = -158.78902
Iteration 2: log likelihood = -158.64499
Iteration 3: log likelihood = -158.64497
Iteration 4: log likelihood = -158.64497
Fitting full model:
Iteration 0: log likelihood = -158.64497
Weibull regression -- log relative-hazard form
No. of subjects = 93 Number of obs = 93
No. of failures = 62
Time at risk = 60250
LR chi2(0) = -0.00
Log likelihood = -158.64497 Prob > chi2 = .
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------
-------------+------
_cons | -4.307123 .4483219 -9.61 0.000 -5.185818 -3.428429
-------------+----------------------------------------------------------
-------------+------
/ln_p | -.4638212 .1020754 -4.54 0.000 -.6638854 -.263757
-------------+----------------------------------------------------------
-------------+------
p | .628876 .0641928 .5148471 .7681602
1/p | 1.590139 .1623141 1.301812 1.942324
We then create the probabilities with an equation () that uses p and _cons as well as t for time (i.e., Markov cycle number) and u for cycle length (usually a year, mine is 90 days since I’m working with leukemia patients who are very likely to have an event, i.e., relapse or die).
So where lambda = p, gamma = (exp(_cons))
gen result = (exp((lambda*((t-u)^ (gamma)))-(lambda*(t^(gamma)))))
gen transitions = 1-result
Turning to the variability, I first calculate the standard errors for the coefficients
. nlcom (exp(_b[_cons])) (exp(_b[/ln_p]))
_nl_1: exp(_b[_cons])
_nl_2: exp(_b[/ln_p])
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------
-------------+------
_nl_1 | .0116539 .0044932 2.59 0.009 .0028474 .0204604
_nl_2 | .6153864 .054186 11.36 0.000 .5091838 .721589
But what I’m really after is the standard errors on the transitions values, e.g.,
nlcom (_b[transitions])
But this doesn’t work and the book I'm using doesn't give hints on getting at this extra info. Any feedback on how to get closer would be much appreciated.

Second Answer: 2014-03-23
Update: 2014-03-26 I fixed the negative probabilities: I'd made an error in transcribing Emily's code. I also show that nlcom as suggested on Statalist by Austin Nichols (http://www.stata.com/statalist/archive/2014-03/msg00002.html). I made one correction to Austin's code.
Bootstrapping is still the key to the solution. The target quantities are probabilities calculated by a formula that is based on a nonlinear combination of estimated parameters from streg. As the estimates are not contained in the matrix e(b) returned after streg, nlcom will not estimate the standard errors. This is an ideal situation for bootstrapping. The standard approach is adopted: create a program myprog to estimate the parameters; then bootstrap that program.
In the example, transition probabilities pt for a range of t values are estimated. The user must set the minimum and maximum of the t range as well as a scalar u. Of interest, perhaps, is that , since the number of estimated parameters is variable, a foreach statement is required inside myprog. Also, bootstrap requires an argument that consists of a list of estimates returned by myprog. This list is also constructed in a foreach loop.
/* set u and minimum and maximum times here */
scalar u = 1
local tmin = 1
local tmax = 3
set linesize 80
capture program drop _all
program define myprog , rclass
syntax anything
streg , nohr dist(weibull)
scalar lambda = exp(_b[ln_p:_cons])
scalar gamma =exp(_b[_t:_cons])
forvalues t = `1'/`2'{
scalar p`t'= 1 - ///
(exp((lambda*((`t'-u)^(gamma)))-(lambda*(`t'^(gamma)))))
return scalar p`t' = p`t'
}
end
webuse cancer, clear
stset studytime, fail(died)
set seed 450811
/* set up list of returned probabilities for bootstrap */
forvalues t = `tmin'/`tmax' {
local p`t' = "p" + string(`t')
local rp`t'= "`p`t''" + "=" + "("+ "r(" + "`p`t''" +"))"
local rlist = `"`rlist' `rp`t''"'
}
bootstrap `rlist', nodots: myprog `tmin' `tmax'
forvalues t = `tmin'/`tmax' {
qui streg, nohr dist(weibull)
nlcom 1 - ///
(exp((exp(_b[ln_p:_cons])*((`t'-u)^(exp(_b[_t:_cons]))))- ///
(exp(_b[ln_p:_cons])*(`t'^(exp(_b[_t:_cons]))))))
}
Results:
Bootstrap results Number of obs = 48
Replications = 50
command: myprog 1 3
p1: r(p1)
p2: r(p2)
p3: r(p3)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p1 | .7009447 .0503893 13.91 0.000 .6021834 .7997059
p2 | .0187127 .007727 2.42 0.015 .0035681 .0338573
p3 | .0111243 .0047095 2.36 0.018 .0018939 .0203548
------------------------------------------------------------------------------
/* results of nlcom */
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_nl_1 | .7009447 .0543671 12.89 0.000 .594387 .8075023
-------------+----------------------------------------------------------------
_nl_1 | .0187127 .0082077 2.28 0.023 .0026259 .0347995
-------------+----------------------------------------------------------------
_nl_1 | .0111243 .0049765 2.24 0.025 .0013706 .0208781
------------------------------------------------------------------------------

This answer is incorrect, but I keep it because it generated some useful comments
sysuse auto, clear
gen u = 90 +rnormal()
set seed 1234
capture program drop _all
program define myprog , rclass
tempvar result
reg turn disp /* Here substitute your -streg- statement */
gen result' = _b[disp]*u
sumresult'
return scalar sd = r(sd)
end
bootstrap sdr = r(sd): myprog
estat bootstrap, bc percentile
Of note: in the bootstrapped program, the new variable (your result) must be defined as temporary; otherwise the gen statement will lead to an error because the variable is created anew for each bootstrap replicate.

Related

Stata identification of beta distribution alpha and beta

When data is loaded into Stata variable names appear, and apparently I can make a histogram to look at the distribution:
histogram *variablename*
But if a distribution appears to be a beta distribution, how is the alpha and beta found?
betafit from SSC does this. Typing in Stata
. search beta
would point to this (and indeed much else).
Example:
. clear
. set seed 2803
. set obs 100
number of observations (_N) was 0, now 100
. gen y = rbeta(2, 2)
. ssc install betafit
checking betafit consistency and verifying not already installed...
installing into !!!
installation complete.
. betafit y
Iteration 0: log likelihood = 12.898998
Iteration 1: log likelihood = 13.093709
Iteration 2: log likelihood = 13.094329
Iteration 3: log likelihood = 13.094329
ML fit of beta (alpha, beta) Number of obs = 100
Wald chi2(0) = .
Log likelihood = 13.094329 Prob > chi2 = .
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
alpha |
_cons | 1.945302 .2606829 7.46 0.000 1.434373 2.456232
-------------+----------------------------------------------------------------
beta |
_cons | 2.11004 .2854789 7.39 0.000 1.550512 2.669568
------------------------------------------------------------------------------

Interaction Variables when Factor Variables are not allowed (Stata)

I'm running a regression in Stata for which I would like to use cluster2 (http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/se_programming.htm).
I encounter the following problem. Stata reports factor variables and time-series operators not allowed. I am using a large vector of controls, extensively applying the methods Stata offers for interactions.
For example: state##c.wind_speed##L.c.relative_humidity. cluster2 and also other Stata packages do not allow to include such expressions as independent variables. Is there a productive way how to create such a long vector of interaction variables myself?
I believe that one can trick ivreg2 by Baum-Shaffer-Stillman into running OLS with two-way clustering and interactions thusly:
. webuse nlswork
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. ivreg2 ln_w grade c.age##c.ttl_exp tenure, cluster(idcode year)
OLS estimation
--------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on idcode and year
Number of clusters (idcode) = 4697 Number of obs = 28099
Number of clusters (year) = 15 F( 5, 14) = 674.29
Prob > F = 0.0000
Total (centered) SS = 6414.823933 Centered R2 = 0.3206
Total (uncentered) SS = 85448.21266 Uncentered R2 = 0.9490
Residual SS = 4357.997339 Root MSE = .3938
---------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
grade | .0734785 .002644 27.79 0.000 .0682964 .0786606
age | -.0005405 .002259 -0.24 0.811 -.0049681 .0038871
ttl_exp | .0656393 .0068499 9.58 0.000 .0522138 .0790648
|
c.age#c.ttl_exp | -.0010539 .0002217 -4.75 0.000 -.0014885 -.0006194
|
tenure | .0197137 .0029555 6.67 0.000 .013921 .0255064
_cons | .5165052 .0529343 9.76 0.000 .4127559 .6202544
---------------------------------------------------------------------------------
Included instruments: grade age ttl_exp c.age#c.ttl_exp tenure
------------------------------------------------------------------------------
Just to be sure compare that to OLS coefficients:
. reg ln_w grade c.age##c.ttl_exp tenure
Source | SS df MS Number of obs = 28,099
-------------+---------------------------------- F(5, 28093) = 2651.79
Model | 2056.82659 5 411.365319 Prob > F = 0.0000
Residual | 4357.99734 28,093 .155127517 R-squared = 0.3206
-------------+---------------------------------- Adj R-squared = 0.3205
Total | 6414.82393 28,098 .228301798 Root MSE = .39386
---------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
grade | .0734785 .0010414 70.55 0.000 .0714373 .0755198
age | -.0005405 .000663 -0.82 0.415 -.0018401 .0007591
ttl_exp | .0656393 .0030809 21.31 0.000 .0596007 .0716779
|
c.age#c.ttl_exp | -.0010539 .0000856 -12.32 0.000 -.0012216 -.0008862
|
tenure | .0197137 .0008568 23.01 0.000 .0180344 .021393
_cons | .5165052 .0206744 24.98 0.000 .4759823 .557028
---------------------------------------------------------------------------------
You don't include a verifiable example. See https://stackoverflow.com/help/mcve for key advice.
At first sight, however, the problem is that cluster2 is an oldish program written in 2006/2007 whose syntax statement just doesn't allow factor variables.
You could try hacking a clone of the program to fix that; I have no idea whether that would be sufficient.
No specific comment is possible on the "other Stata packages" you imply to have the same problem except that it may well arise for the same reason. Factor variables were introduced in Stata 11 (see here for documentation) in 2009 and older programs won't allow them without modification.
In general, I would ask questions like this on Statalist. It's quite likely that this program has been superseded by some different program.
If you find a Stata program on the internet without a help file, as appears to be the case here, it is usually an indicator that the program was written ad hoc and is not being maintained. In this case, it is evident also that the program has not been updated in the 6 years since Stata 11.
You could also, as you imply, just create the interaction variables yourself. I don't think anyone has written a really general tool to automate that: there would be no point (since 2009) in a complicated alternative to factor variable notation.

Marginal effect of interaction variable in probit regression using Stata

I am running a probit regression with an interaction between one continuous and one dummy variable. The coefficient is displayed in the regression output but when I look at the marginal effects the interaction is missing.
How can I get the marginal effect of the interaction variable?
probit move_right c.real_income_change_percent##i.gender
Iteration 0: log likelihood = -345.57292
Iteration 1: log likelihood = -339.10962
Iteration 2: log likelihood = -339.10565
Iteration 3: log likelihood = -339.10565
Probit regression Number of obs = 958
LR chi2(3) = 12.93
Prob > chi2 = 0.0048
Log likelihood = -339.10565 Pseudo R2 = 0.0187
-----------------------------------------------------------------------------------------------------
move_right | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------+----------------------------------------------------------------
real_income_change_percent | .0034604 .0010125 3.42 0.001 .001476 .0054448
|
gender |
Female | .0695646 .1139538 0.61 0.542 -.1537807 .2929099
|
gender#c.real_income_change_percent |
Female | -.0039908 .0015254 -2.62 0.009 -.0069805 -.0010011
|
_cons | -1.263463 .0798439 -15.82 0.000 -1.419954 -1.106972
-----------------------------------------------------------------------------------------------------
margins, dydx(*) post
Average marginal effects Number of obs = 958
Model VCE : OIM
Expression : Pr(move_right), predict()
dy/dx w.r.t. : real_income_change_percent 1.gender
--------------------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
---------------------------+----------------------------------------------------------------
real_income_change_percent | .0002846 .0001454 1.96 0.050 -4.15e-07 .0005697
|
gender |
Female | -.0102626 .0207666 -0.49 0.621 -.0509643 .0304392
--------------------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
Your question seems strange to me. You asked about a dummy-dummy interaction, but your example involves continuous-dummy interaction.
Here's how to do either one:
webuse union, clear
/* dummy-dummy iteraction */
probit union i.south##i.black grade, nolog
margins r.south#r.black
/* continuous-dummy iteraction */
probit union i.south##c.grade
margins r.south, dydx(grade)
You should try to reproduce these by "hand" (using differences of predicts) to understand what the margins command is doing behind the scenes.
This sounds like a question specific to a piece of software (Stata), hence the close votes, but there is a statistical question lurking here: What would a marginal effect of an interaction effect look like?
Such marginal effects are not trivial, and tend to depend strongly on the values of the other covariates, see this article. Often this marginal effect is so variable that it makes no sense to try to summarize it with one number. In my opinion, this is a major weakness. To the extend that in general I tend to prefer using logistic regression and interpret the interaction term as a ratio of odds ratios, see this article.

include panel-specific trends in a first-difference regression

I was wondering if there's a way to include panel-specific or just varying trends in a first-difference regression when clustering on the panel id and the time variable.
Here's an example of with Stata:
. webuse nlswork
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. ivreg2 S1.(ln_wage tenure) , cluster(idcode year)
OLS estimation
--------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on idcode and year
Number of clusters (idcode) = 3660 Number of obs = 10528
Number of clusters (year) = 8 F( 1, 7) = 2.81
Prob > F = 0.1378
Total (centered) SS = 1004.098948 Centered R2 = 0.0007
Total (uncentered) SS = 1035.845686 Uncentered R2 = 0.0314
Residual SS = 1003.36326 Root MSE = .3087
------------------------------------------------------------------------------
| Robust
S.ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure |
S1. | .0076418 .0042666 1.79 0.073 -.0007206 .0160043
|
_cons | .0501738 .0070986 7.07 0.000 .0362608 .0640868
------------------------------------------------------------------------------
Included instruments: S.tenure
------------------------------------------------------------------------------
. ivreg2 S1.(ln_wage tenure i.c_city), cluster(idcode year)
factor variables not allowed
r(101);
In the specification above, the constant corresponds to a common time trend. Putting the factor variable outside the seasonal difference operator errors as well.
I understand that the differencing operator does not play well with factor variables or interactions, but I feel there must be some hack to get around that.
The ivreg2 is a bit of a red herring. I am not doing IV estimation, I just want to use two-way clustering.
You get the same solution as #Metrics if you do xi: ivreg2 S1.(ln_wage tenure) i.ind_code , cluster(idcode year)

What command in Stata 12 do I use to interpret the coefficients of the Limited Dependent Variable model?

I am running the following code:
oprobit var1 var2 var3 var4 var5 var2##var3 var4##var5 var6 var7 etc.
Without the interaction terms I could have used the following code to interpret the coefficients:
mfx compute, predict(outcome(2))
[for outcome equaling 2 (in total I have 4 outcomes)]
But since mfx does not work with the interaction terms, I get an error.
I tried to use
margins command, but it did not work either!!!
margins var2 var3 var4 var5 var2##var3 var4##var5 var6 var7 etc... , post
margins works ONLY for the interaction terms: (margins var2 var3 var4 var5, post)
What command do I use to be able to interpret BOTH interaction and regular variables?
Finally, to use simple language, my question is: given the regression model above, what command can I use to interpret the coefficients?
mfx is an old command that has been replaced with margins. That is why it does not work with factor variable notation that you used to define the interactions. I am not clear what you actually intended to calculate with the margins command.
Here's an example of how you can get the average marginal effects on the probability of outcome 2:
. webuse fullauto
(Automobile Models)
. oprobit rep77 i.foreign c.weight c.length##c.mpg
Iteration 0: log likelihood = -89.895098
Iteration 1: log likelihood = -76.800575
Iteration 2: log likelihood = -76.709641
Iteration 3: log likelihood = -76.709553
Iteration 4: log likelihood = -76.709553
Ordered probit regression Number of obs = 66
LR chi2(5) = 26.37
Prob > chi2 = 0.0001
Log likelihood = -76.709553 Pseudo R2 = 0.1467
--------------------------------------------------------------------------------
rep77 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
1.foreign | 1.514739 .4497962 3.37 0.001 .633155 2.396324
weight | -.0005104 .0005861 -0.87 0.384 -.0016593 .0006384
length | .0969601 .0348506 2.78 0.005 .0286542 .165266
mpg | .4747249 .2241349 2.12 0.034 .0354286 .9140211
|
c.length#c.mpg | -.0020602 .0013145 -1.57 0.117 -.0046366 .0005161
---------------+----------------------------------------------------------------
/cut1 | 17.21885 5.386033 6.662419 27.77528
/cut2 | 18.29469 5.416843 7.677877 28.91151
/cut3 | 19.66512 5.463523 8.956814 30.37343
/cut4 | 21.12134 5.515901 10.31038 31.93231
--------------------------------------------------------------------------------
. margins, dydx(*) predict(outcome(2))
Average marginal effects Number of obs = 66
Model VCE : OIM
Expression : Pr(rep77==2), predict(outcome(2))
dy/dx w.r.t. : 1.foreign weight length mpg
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.foreign | -.2002434 .0576487 -3.47 0.001 -.3132327 -.087254
weight | .0000828 .0000961 0.86 0.389 -.0001055 .0002711
length | -.0088956 .003643 -2.44 0.015 -.0160356 -.0017555
mpg | -.012849 .0085546 -1.50 0.133 -.0296157 .0039178
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
If you want the prediction, rather than the marginal effect, try
margins, predict(outcome(2))
The marginal effect of just the interaction term is harder to calculate in a non-linear model. Details here.
The marginal effects for positive outcomes, Pr(depvar1=1, depvar2=1), are
. mfx compute, predict(p11)
The marginal effects for Pr(depvar1=1, depvar2=0) are
. mfx compute, predict(p10)
The marginal effects for Pr(depvar1=0, depvar2=1) are
. mfx compute, predict(p01)