Is there any Stata code for calculating the c-index (also known as c-statistics and concordance statistics), for fine gray?
I found an R package that does this but I have no idea how to do it in Stata.
I think the following is what you are looking for:
webuse lbw, clear
logit low age lwt i.race smoke ptl ht ui
lroc
Logistic model for low
number of observations = 189
area under ROC curve = 0.7462
The last line reports the c-statistic.
Type help lroc from Stata's command prompt for more information.
EDIT:
I just noticed that you mention "Fine-Gray" so i assume that you are referring to survival analysis regression. The community-contributed command somersd does what you want in Stata.
You can get its latest version from here:
net describe somersd, from(http://www.rogernewsonresources.org.uk/stata12)
After you install the package, type help somersd for additional details and syntax.
An example from the author's help file illustrates its use:
use http://www.stata-press.com/data/r9/drugtr, clear
generate youth = 100 - age
generate byte censind = 1 - died
somersd studytime drug youth, tr(c) cenind(censind)
Somers' D with variable: studytime
Transformation: Harrell's c
Valid observations: 48
Symmetric 95% CI for Harrell's c
------------------------------------------------------------------------------
| Jackknife
studytime | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
drug | .7275986 .0367931 19.78 0.000 .6554855 .7997117
youth | .6415771 .0528314 12.14 0.000 .5380295 .7451246
------------------------------------------------------------------------------
Related
I have panel data and my regression is of the form:
s_roa1 = s_roa + c_roa
I am new to Stata and i am trying to use the xtoverid command for a robust hausman test to help me choose between a fixed or random effects model:
xtoverid s_roa1 s_roa c_roa, fe i (year)
However, I get the following error:
varlist not allowed
Can anyone help me understand what does this suggest?
First of all, xtoverid is a community-contributed command, something which you fail to make clear in your question. It is customary and useful to provide this information right from the start, so others know that you do not refer to an official, built-in command.
Second, this is a post-estimation command, which means you run it directly after you estimate your model using xtreg, xtivreg, xtivreg2 or xthtaylor.
The help file provided by the authors offers an enlightening example:
. webuse nlswork
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. tsset idcode year
panel variable: idcode (unbalanced)
time variable: year, 68 to 88, but with gaps
delta: 1 unit
.
. gen age2=age^2
(24 missing values generated)
. gen black=(race==2)
.
. xtivreg ln_wage age (tenure = union south), fe i(idcode)
Fixed-effects (within) IV regression Number of obs = 19,007
Group variable: idcode Number of groups = 4,134
R-sq: Obs per group:
within = . min = 1
between = 0.1261 avg = 4.6
overall = 0.0869 max = 12
Wald chi2(2) = 142054.65
corr(u_i, Xb) = -0.6875 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure | .2450528 .0382041 6.41 0.000 .1701741 .3199314
age | -.0650873 .0126167 -5.16 0.000 -.0898156 -.040359
_cons | 2.826672 .2451883 11.53 0.000 2.346112 3.307232
-------------+----------------------------------------------------------------
sigma_u | .71990151
sigma_e | .64315554
rho | .55612637 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4133,14871) = 1.53 Prob > F = 0.0000
------------------------------------------------------------------------------
Instrumented: tenure
Instruments: age union south
------------------------------------------------------------------------------
.
. xtoverid
Test of overidentifying restrictions:
Cross-section time-series model: xtivreg fe
Sargan-Hansen statistic 0.965 Chi-sq(1) P-value = 0.3259
. xtoverid, robust
Test of overidentifying restrictions:
Cross-section time-series model: xtivreg fe robust
Sargan-Hansen statistic 0.960 Chi-sq(1) P-value = 0.3271
. xtoverid, cluster(idcode)
Test of overidentifying restrictions:
Cross-section time-series model: xtivreg fe robust cluster(idcode)
Sargan-Hansen statistic 0.495 Chi-sq(1) P-value = 0.4818
From Stata's command prompt, type help xtoverid for more details.
I'm running a regression in Stata for which I would like to use cluster2 (http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/se_programming.htm).
I encounter the following problem. Stata reports factor variables and time-series operators not allowed. I am using a large vector of controls, extensively applying the methods Stata offers for interactions.
For example: state##c.wind_speed##L.c.relative_humidity. cluster2 and also other Stata packages do not allow to include such expressions as independent variables. Is there a productive way how to create such a long vector of interaction variables myself?
I believe that one can trick ivreg2 by Baum-Shaffer-Stillman into running OLS with two-way clustering and interactions thusly:
. webuse nlswork
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. ivreg2 ln_w grade c.age##c.ttl_exp tenure, cluster(idcode year)
OLS estimation
--------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on idcode and year
Number of clusters (idcode) = 4697 Number of obs = 28099
Number of clusters (year) = 15 F( 5, 14) = 674.29
Prob > F = 0.0000
Total (centered) SS = 6414.823933 Centered R2 = 0.3206
Total (uncentered) SS = 85448.21266 Uncentered R2 = 0.9490
Residual SS = 4357.997339 Root MSE = .3938
---------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
grade | .0734785 .002644 27.79 0.000 .0682964 .0786606
age | -.0005405 .002259 -0.24 0.811 -.0049681 .0038871
ttl_exp | .0656393 .0068499 9.58 0.000 .0522138 .0790648
|
c.age#c.ttl_exp | -.0010539 .0002217 -4.75 0.000 -.0014885 -.0006194
|
tenure | .0197137 .0029555 6.67 0.000 .013921 .0255064
_cons | .5165052 .0529343 9.76 0.000 .4127559 .6202544
---------------------------------------------------------------------------------
Included instruments: grade age ttl_exp c.age#c.ttl_exp tenure
------------------------------------------------------------------------------
Just to be sure compare that to OLS coefficients:
. reg ln_w grade c.age##c.ttl_exp tenure
Source | SS df MS Number of obs = 28,099
-------------+---------------------------------- F(5, 28093) = 2651.79
Model | 2056.82659 5 411.365319 Prob > F = 0.0000
Residual | 4357.99734 28,093 .155127517 R-squared = 0.3206
-------------+---------------------------------- Adj R-squared = 0.3205
Total | 6414.82393 28,098 .228301798 Root MSE = .39386
---------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
grade | .0734785 .0010414 70.55 0.000 .0714373 .0755198
age | -.0005405 .000663 -0.82 0.415 -.0018401 .0007591
ttl_exp | .0656393 .0030809 21.31 0.000 .0596007 .0716779
|
c.age#c.ttl_exp | -.0010539 .0000856 -12.32 0.000 -.0012216 -.0008862
|
tenure | .0197137 .0008568 23.01 0.000 .0180344 .021393
_cons | .5165052 .0206744 24.98 0.000 .4759823 .557028
---------------------------------------------------------------------------------
You don't include a verifiable example. See https://stackoverflow.com/help/mcve for key advice.
At first sight, however, the problem is that cluster2 is an oldish program written in 2006/2007 whose syntax statement just doesn't allow factor variables.
You could try hacking a clone of the program to fix that; I have no idea whether that would be sufficient.
No specific comment is possible on the "other Stata packages" you imply to have the same problem except that it may well arise for the same reason. Factor variables were introduced in Stata 11 (see here for documentation) in 2009 and older programs won't allow them without modification.
In general, I would ask questions like this on Statalist. It's quite likely that this program has been superseded by some different program.
If you find a Stata program on the internet without a help file, as appears to be the case here, it is usually an indicator that the program was written ad hoc and is not being maintained. In this case, it is evident also that the program has not been updated in the 6 years since Stata 11.
You could also, as you imply, just create the interaction variables yourself. I don't think anyone has written a really general tool to automate that: there would be no point (since 2009) in a complicated alternative to factor variable notation.
I am running a probit regression with an interaction between one continuous and one dummy variable. The coefficient is displayed in the regression output but when I look at the marginal effects the interaction is missing.
How can I get the marginal effect of the interaction variable?
probit move_right c.real_income_change_percent##i.gender
Iteration 0: log likelihood = -345.57292
Iteration 1: log likelihood = -339.10962
Iteration 2: log likelihood = -339.10565
Iteration 3: log likelihood = -339.10565
Probit regression Number of obs = 958
LR chi2(3) = 12.93
Prob > chi2 = 0.0048
Log likelihood = -339.10565 Pseudo R2 = 0.0187
-----------------------------------------------------------------------------------------------------
move_right | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------+----------------------------------------------------------------
real_income_change_percent | .0034604 .0010125 3.42 0.001 .001476 .0054448
|
gender |
Female | .0695646 .1139538 0.61 0.542 -.1537807 .2929099
|
gender#c.real_income_change_percent |
Female | -.0039908 .0015254 -2.62 0.009 -.0069805 -.0010011
|
_cons | -1.263463 .0798439 -15.82 0.000 -1.419954 -1.106972
-----------------------------------------------------------------------------------------------------
margins, dydx(*) post
Average marginal effects Number of obs = 958
Model VCE : OIM
Expression : Pr(move_right), predict()
dy/dx w.r.t. : real_income_change_percent 1.gender
--------------------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
---------------------------+----------------------------------------------------------------
real_income_change_percent | .0002846 .0001454 1.96 0.050 -4.15e-07 .0005697
|
gender |
Female | -.0102626 .0207666 -0.49 0.621 -.0509643 .0304392
--------------------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
Your question seems strange to me. You asked about a dummy-dummy interaction, but your example involves continuous-dummy interaction.
Here's how to do either one:
webuse union, clear
/* dummy-dummy iteraction */
probit union i.south##i.black grade, nolog
margins r.south#r.black
/* continuous-dummy iteraction */
probit union i.south##c.grade
margins r.south, dydx(grade)
You should try to reproduce these by "hand" (using differences of predicts) to understand what the margins command is doing behind the scenes.
This sounds like a question specific to a piece of software (Stata), hence the close votes, but there is a statistical question lurking here: What would a marginal effect of an interaction effect look like?
Such marginal effects are not trivial, and tend to depend strongly on the values of the other covariates, see this article. Often this marginal effect is so variable that it makes no sense to try to summarize it with one number. In my opinion, this is a major weakness. To the extend that in general I tend to prefer using logistic regression and interpret the interaction term as a ratio of odds ratios, see this article.
I was wondering if there's a way to include panel-specific or just varying trends in a first-difference regression when clustering on the panel id and the time variable.
Here's an example of with Stata:
. webuse nlswork
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. ivreg2 S1.(ln_wage tenure) , cluster(idcode year)
OLS estimation
--------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on idcode and year
Number of clusters (idcode) = 3660 Number of obs = 10528
Number of clusters (year) = 8 F( 1, 7) = 2.81
Prob > F = 0.1378
Total (centered) SS = 1004.098948 Centered R2 = 0.0007
Total (uncentered) SS = 1035.845686 Uncentered R2 = 0.0314
Residual SS = 1003.36326 Root MSE = .3087
------------------------------------------------------------------------------
| Robust
S.ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure |
S1. | .0076418 .0042666 1.79 0.073 -.0007206 .0160043
|
_cons | .0501738 .0070986 7.07 0.000 .0362608 .0640868
------------------------------------------------------------------------------
Included instruments: S.tenure
------------------------------------------------------------------------------
. ivreg2 S1.(ln_wage tenure i.c_city), cluster(idcode year)
factor variables not allowed
r(101);
In the specification above, the constant corresponds to a common time trend. Putting the factor variable outside the seasonal difference operator errors as well.
I understand that the differencing operator does not play well with factor variables or interactions, but I feel there must be some hack to get around that.
The ivreg2 is a bit of a red herring. I am not doing IV estimation, I just want to use two-way clustering.
You get the same solution as #Metrics if you do xi: ivreg2 S1.(ln_wage tenure) i.ind_code , cluster(idcode year)
My question relates to calculating the standard deviation (SD) of transition probabilities derived from coefficients estimated through Weibull regression in Stata.
The transition probabilities are being used to model disease progression of leukemia patients over 40 cycles of 90 days (about 10 years). I need the SDs of the probabilities (which change over the run of the Markov model) to create beta distributions whose parameters can be approximated using the corresponding Markov cycle probability and its SD. These distributions are then used to do Probabilistic sensitivity analysis, i.e., they are substituted for the simple probabilities (one for each cycle) and random draws from them can evaluate the robustness of the model’s cost-effectiveness results.
Anyway, using time to event survival data, I’ve used regression analysis to estimate coefficients that can be plugged into an equation to generate transition probabilities. For example...
. streg, nohr dist(weibull)
failure _d: event
analysis time _t: time
Fitting constant-only model:
Iteration 0: log likelihood = -171.82384
Iteration 1: log likelihood = -158.78902
Iteration 2: log likelihood = -158.64499
Iteration 3: log likelihood = -158.64497
Iteration 4: log likelihood = -158.64497
Fitting full model:
Iteration 0: log likelihood = -158.64497
Weibull regression -- log relative-hazard form
No. of subjects = 93 Number of obs = 93
No. of failures = 62
Time at risk = 60250
LR chi2(0) = -0.00
Log likelihood = -158.64497 Prob > chi2 = .
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------
-------------+------
_cons | -4.307123 .4483219 -9.61 0.000 -5.185818 -3.428429
-------------+----------------------------------------------------------
-------------+------
/ln_p | -.4638212 .1020754 -4.54 0.000 -.6638854 -.263757
-------------+----------------------------------------------------------
-------------+------
p | .628876 .0641928 .5148471 .7681602
1/p | 1.590139 .1623141 1.301812 1.942324
We then create the probabilities with an equation () that uses p and _cons as well as t for time (i.e., Markov cycle number) and u for cycle length (usually a year, mine is 90 days since I’m working with leukemia patients who are very likely to have an event, i.e., relapse or die).
So where lambda = p, gamma = (exp(_cons))
gen result = (exp((lambda*((t-u)^ (gamma)))-(lambda*(t^(gamma)))))
gen transitions = 1-result
Turning to the variability, I first calculate the standard errors for the coefficients
. nlcom (exp(_b[_cons])) (exp(_b[/ln_p]))
_nl_1: exp(_b[_cons])
_nl_2: exp(_b[/ln_p])
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------
-------------+------
_nl_1 | .0116539 .0044932 2.59 0.009 .0028474 .0204604
_nl_2 | .6153864 .054186 11.36 0.000 .5091838 .721589
But what I’m really after is the standard errors on the transitions values, e.g.,
nlcom (_b[transitions])
But this doesn’t work and the book I'm using doesn't give hints on getting at this extra info. Any feedback on how to get closer would be much appreciated.
Second Answer: 2014-03-23
Update: 2014-03-26 I fixed the negative probabilities: I'd made an error in transcribing Emily's code. I also show that nlcom as suggested on Statalist by Austin Nichols (http://www.stata.com/statalist/archive/2014-03/msg00002.html). I made one correction to Austin's code.
Bootstrapping is still the key to the solution. The target quantities are probabilities calculated by a formula that is based on a nonlinear combination of estimated parameters from streg. As the estimates are not contained in the matrix e(b) returned after streg, nlcom will not estimate the standard errors. This is an ideal situation for bootstrapping. The standard approach is adopted: create a program myprog to estimate the parameters; then bootstrap that program.
In the example, transition probabilities pt for a range of t values are estimated. The user must set the minimum and maximum of the t range as well as a scalar u. Of interest, perhaps, is that , since the number of estimated parameters is variable, a foreach statement is required inside myprog. Also, bootstrap requires an argument that consists of a list of estimates returned by myprog. This list is also constructed in a foreach loop.
/* set u and minimum and maximum times here */
scalar u = 1
local tmin = 1
local tmax = 3
set linesize 80
capture program drop _all
program define myprog , rclass
syntax anything
streg , nohr dist(weibull)
scalar lambda = exp(_b[ln_p:_cons])
scalar gamma =exp(_b[_t:_cons])
forvalues t = `1'/`2'{
scalar p`t'= 1 - ///
(exp((lambda*((`t'-u)^(gamma)))-(lambda*(`t'^(gamma)))))
return scalar p`t' = p`t'
}
end
webuse cancer, clear
stset studytime, fail(died)
set seed 450811
/* set up list of returned probabilities for bootstrap */
forvalues t = `tmin'/`tmax' {
local p`t' = "p" + string(`t')
local rp`t'= "`p`t''" + "=" + "("+ "r(" + "`p`t''" +"))"
local rlist = `"`rlist' `rp`t''"'
}
bootstrap `rlist', nodots: myprog `tmin' `tmax'
forvalues t = `tmin'/`tmax' {
qui streg, nohr dist(weibull)
nlcom 1 - ///
(exp((exp(_b[ln_p:_cons])*((`t'-u)^(exp(_b[_t:_cons]))))- ///
(exp(_b[ln_p:_cons])*(`t'^(exp(_b[_t:_cons]))))))
}
Results:
Bootstrap results Number of obs = 48
Replications = 50
command: myprog 1 3
p1: r(p1)
p2: r(p2)
p3: r(p3)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p1 | .7009447 .0503893 13.91 0.000 .6021834 .7997059
p2 | .0187127 .007727 2.42 0.015 .0035681 .0338573
p3 | .0111243 .0047095 2.36 0.018 .0018939 .0203548
------------------------------------------------------------------------------
/* results of nlcom */
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_nl_1 | .7009447 .0543671 12.89 0.000 .594387 .8075023
-------------+----------------------------------------------------------------
_nl_1 | .0187127 .0082077 2.28 0.023 .0026259 .0347995
-------------+----------------------------------------------------------------
_nl_1 | .0111243 .0049765 2.24 0.025 .0013706 .0208781
------------------------------------------------------------------------------
This answer is incorrect, but I keep it because it generated some useful comments
sysuse auto, clear
gen u = 90 +rnormal()
set seed 1234
capture program drop _all
program define myprog , rclass
tempvar result
reg turn disp /* Here substitute your -streg- statement */
gen result' = _b[disp]*u
sumresult'
return scalar sd = r(sd)
end
bootstrap sdr = r(sd): myprog
estat bootstrap, bc percentile
Of note: in the bootstrapped program, the new variable (your result) must be defined as temporary; otherwise the gen statement will lead to an error because the variable is created anew for each bootstrap replicate.