Export combined tables when using oprobit - stata

I am running an ordered probit with four levels (A lot, Somewhat, Little, Not at all) on a female variable and some controls:
* Baseline only
eststo, title ("OProbit1"): /*quietly*/ oprobit retincome_worry i.female $control_socio, vce(robust)
estimate store OProbit1
* Baseline + Health Controls
eststo, title ("OProbit3"): oprobit retincome_worry i.female $control_socio $control_health, vce(robust)
estimate store OProbit3
I am doing this for marginal effects of the female variable:
* TABLE BASELINE
estimate restore OProbit1
margins, dydx(i.female) predict (outcome(1)) atmeans post
outreg using results\Reg_margins\Reg2.tex, noautosumm replace rtitle(A lot) ctitle(Social Controls) title(Worry about Retirement Income)
estimate restore OProbit1
margins, dydx(i.female) predict (outcome(2)) atmeans post
outreg using results\Reg_margins\Reg2.tex, noautosumm append rtitle(Somewhat)
estimate restore OProbit1
margins, dydx(i.female) predict (outcome(3)) atmeans post
outreg using results\Reg_margins\Reg2.tex, noautosumm append rtitle(Little)
estimate restore OProbit1
margins, dydx(i.female) predict (outcome(4)) atmeans post
outreg using results\Reg_margins\Reg2.tex, noautosumm append rtitle(Not at all) tex
* TABLE BASELINE + HEALTH
estimate restore OProbit3
margins, dydx(i.female) predict (outcome(1)) atmeans post
outreg using results\Reg_margins\Reg3.tex, noautosumm replace rtitle(A lot) ctitle(Baseline and Health) title(Worry about Retirement Income)
estimate restore OProbit3
margins, dydx(i.female) predict (outcome(2)) atmeans post
outreg using results\Reg_margins\Reg3.tex, append noautosumm rtitle(Somewhat)
estimate restore OProbit3
margins, dydx(i.female) predict (outcome(3)) atmeans post
outreg using results\Reg_margins\Reg3.tex, append noautosumm rtitle(Little)
estimate restore OProbit3
margins, dydx(i.female) predict (outcome(4)) atmeans post
outreg using results\Reg_margins\Reg3.tex, append noautosumm rtitle(Not at all) tex
I currently have four tables (see examples for two of them), each with a column name which is the controls included in the model and four rows with each level:
How can I have all of this in a single table, keeping the four rows and adding more columns?

You can get the desired output using the community-contributed command esttab.
First, define the program appendmodels (obtained from here):
capt prog drop appendmodels
*! version 1.0.0 14aug2007 Ben Jann
program appendmodels, eclass
// using first equation of model
version 8
syntax namelist
tempname b V tmp
foreach name of local namelist {
qui est restore `name'
mat `tmp' = e(b)
local eq1: coleq `tmp'
gettoken eq1 : eq1
mat `tmp' = `tmp'[1,"`eq1':"]
local cons = colnumb(`tmp',"_cons")
if `cons'<. & `cons'>1 {
mat `tmp' = `tmp'[1,1..`cons'-1]
}
mat `b' = nullmat(`b') , `tmp'
mat `tmp' = e(V)
mat `tmp' = `tmp'["`eq1':","`eq1':"]
if `cons'<. & `cons'>1 {
mat `tmp' = `tmp'[1..`cons'-1,1..`cons'-1]
}
capt confirm matrix `V'
if _rc {
mat `V' = `tmp'
}
else {
mat `V' = ///
( `V' , J(rowsof(`V'),colsof(`tmp'),0) ) \ ///
( J(rowsof(`tmp'),colsof(`V'),0) , `tmp' )
}
}
local names: colfullnames `b'
mat coln `V' = `names'
mat rown `V' = `names'
eret post `b' `V'
eret local cmd "whatever"
end
Next, run the following (here I use Stata's fullauto toy dataset for illustration):
webuse fullauto, clear
estimates clear
forvalues i = 1 / 4 {
oprobit rep77 i.foreign
margins, dydx(foreign) predict (outcome(`i')) atmeans post
estimate store OProbit1`i'
}
appendmodels OProbit11 OProbit12 OProbit13 OProbit14
estimates store result1
forvalues i = 1 / 4 {
oprobit rep77 i.foreign length mpg
margins, dydx(foreign) predict (outcome(`i')) atmeans post
estimate store OProbit2`i'
}
appendmodels OProbit21 OProbit22 OProbit23 OProbit24
estimates store result2
forvalues i = 1 / 4 {
oprobit rep77 i.foreign trunk weight
margins, dydx(foreign) predict (outcome(`i')) atmeans post
estimate store OProbit3`i'
}
appendmodels OProbit31 OProbit32 OProbit23 OProbit34
estimates store result3
forvalues i = 1 / 4 {
oprobit rep77 i.foreign price displ
margins, dydx(foreign) predict (outcome(`i')) atmeans post
estimate store OProbit4`i'
}
appendmodels OProbit41 OProbit42 OProbit43 OProbit44
estimates store result4
Finally, see the results:
esttab result1 result2 result3 result4, keep(1.foreign) varlab(1.foreign " ") ///
labcol2("A lot" "Somewhat" "A little" "Not at all") gaps noobs nomtitles
-------------------------------------------------------------------------------------
(1) (2) (3) (4)
-------------------------------------------------------------------------------------
A lot -0.0572 -0.0677 -0.0728 -0.0690
(-1.83) (-1.67) (-1.81) (-1.67)
Somewhat -0.144** -0.247*** -0.188** -0.175*
(-2.73) (-3.54) (-2.86) (-2.47)
A little -0.124 -0.290** -0.290** -0.163
(-1.86) (-3.07) (-3.07) (-1.74)
Not at all 0.198** 0.351*** 0.252** 0.237*
(2.64) (3.82) (2.95) (2.55)
-------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
You can install esttab by typing the following in Stata's command prompt:
ssc install estout

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Add mean and sd column in correlation matrix in Stata

I'm trying to create correlation matrix that also includes means and sd's of each variable.
** Set variables used in Summary and Correlation
local variables relationship commission anxiety enjoyment negotiation_efficacy similarity_values similarity_behaviors SPT_confidence own_SPT_effort
** Descriptive statistics
estpost summarize `variables'
matrix table = ( e(mean) \ e(sd) )
matrix rownames table = mean sd
matrix list table
** Correlation matrix
correlate `variables'
matrix C = r(C)
local k = colsof(C)
matrix C = C[1..`=`k'-1',.]
local corr : rownames C
matrix table = ( table \ C )
matrix list table
estadd matrix table = table
local cells table[count](fmt(0) label(Count)) table[mean](fmt(2) label(Mean)) table[sd](fmt(2) label(Standard Deviation))
local drop
foreach row of local corr {
local drop `drop' `row'
local cells `cells' table[`row'](fmt(4) drop(`drop'))
}
display "`cells'"
esttab using Report.rtf,
replace
noobs
nonumbers
compress
cells("`cells'")
If it helps, this is what the correlation code looks like:
asdoc corr relationship commission anxiety enjoyment negotiation_efficacy similarity_values similarity_behaviors SPT_confidence own_SPT_effort ranger_SPT_effort cooperative_motivation competitive_motivation, nonum
This correlation matrix looks exactly how it should, but I'm essentially hoping to add means and sd's to the beginning.
*This is cross-posted here: https://www.statalist.org/forums/forum/general-stata-discussion/general/1549809-add-mean-and-sd-column-in-correlation-matrix-in-stata
It's not clear for me whether you want the table to include significance stars or not. If not you can just use corr and a loop to obtain sd and mean, then use frmttable. Seems shorter than your current example. Here's an example
bcuse wage2
global variables "wage hours educ exper"
corr $variables
matrix corr_t = r(C)
local rows = rowsof(corr_t)
di "`rows'"
matrix add = J(`rows',2,.)
matrix list add
local n = 1
foreach x of global variables {
sum `x'
mat add[`n',1] = r(sd)
mat add[`n',2] = r(mean)
local n = `n' + 1
}
matrix final = corr_t,add
matrix list final
frmttable, statmat(final) sdec(2) ctitle("","wage", "hours", "educ", "exper","sd","mean") rtitle("wage"\ "hours"\ "educ" \ "exper")

How to display r-squared for multiple models using outreg2

I run two regressions for which I would like to show the r-squared:
logit y c.x1 c.x2
quietly est store e1
local r1 = e(r2_p)
logit y c.x1 c.x2 c.x3
quietly est store e2
local r2 = e(r2_p)
I tried to create a matrix to fill it but was not successful:
mat t1=J(1,2,0) //Defining empty matrix with 2 columns 1 row
local rsq `r*' //Trying to store r1 and r2 as numeric
local a=1
forval i=1/2{
mat t1[`i'+1,`a']= `r*' // filling each row, one at a time, this fails
loc ++a
}
mat li t1
Ultimately, I would like to export the results with the community-contributed Stata command outreg2:
outreg2 [e*] using "myfile", excel replace addstat(Adj. R^2:, `rsq')
The following works for me:
webuse lbw, clear
logit low age smoke
outreg2 using "myfile.txt", replace addstat(Adj. R^2, e(r2_p))
logit low age smoke ptl ht ui
outreg2 using "myfile.txt", addstat(Adj. R^2, e(r2_p)) append
type myfile.txt
(1) (2)
VARIABLES low low
age -0.0498 -0.0541
(0.0320) (0.0339)
smoke 0.692** 0.557
(0.322) (0.339)
ptl 0.679**
(0.344)
ht 1.408**
(0.624)
ui 0.817*
(0.451)
Constant 0.0609 -0.168
(0.757) (0.806)
Observations 189 189
Adj. R^2 0.0315 0.0882
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

test with missing standard errors

How can I conduct a hypothesis test in Stata when my predictor perfectly predicts my dependent variable?
I would like to run the same regression over many subsets of my data. For each regression, I would then like to test the hypothesis that beta_1 = 1/2. However, for some subsets, I have perfect collinearity, and Stata is not able to calculate standard errors.
For example, in the below case,
sysuse auto, clear
gen value = 2*foreign*(price<6165)
gen value2 = 2*foreign*(price>6165)
gen id = 1 + (price<6165)
I get the output
. reg foreign value value2 weight length, noconstant
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 4, 70) = .
Model | 22 4 5.5 Prob > F = .
Residual | 0 70 0 R-squared = 1.0000
-------------+------------------------------ Adj R-squared = 1.0000
Total | 22 74 .297297297 Root MSE = 0
------------------------------------------------------------------------------
foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
value | .5 . . . . .
value2 | .5 . . . . .
weight | 3.54e-19 . . . . .
length | -6.31e-18 . . . . .
------------------------------------------------------------------------------
and
. test value = .5
( 1) value = .5
F( 1, 70) = .
Prob > F = .
In the actual data, there is usually more variation. So I can identify the cases where the predictor does a very good job of predicting the DV--but I miss those cases where prediction is perfect. Is there a way to conduct a hypothesis test that catches these cases?
EDIT:
The end goal would be to classify observations within subsets based on the hypothesis test. If I cannot reject the hypothesis at the 95% confidence level, I classify the observation as type 1. Below, both groups would be classified as type 1, though I only want the second group.
gen type = .
for values 1/2 {
quietly: reg foreign value value2 weight length if id = `i', noconstant
test value = .5
replace type = 1 if r(p)>.05
}
There is no way to do this out of the box that I'm aware of. Of course you could program it yourself to get an approximation of the p-value in these cases. The standard error is missing here because the relationship between x and y is perfectly collinear. There is no noise in the model, nothing deviates.
Interestingly enough though, the standard error of the estimate is useless in this case anyway. test performs a Wald test for beta_i = exp against beta_i != exp, not a t-test.
The Wald test uses the variance-covariance matrix from the regression. To see this yourself, refer to the Methods and formulas section here and run the following code:
(also, if you remove the -1 from gen mpg2 = and run, you will see the issue)
sysuse auto, clear
gen mpg2 = mpg * 2.5 - 1
qui reg mpg2 mpg, nocons
* collect matrices to calculate Wald statistic
mat b = e(b) // Vector of Coefficients
mat V = e(V) // Var-Cov matrix
mat R = (1) // for use in Rb-r. This does not == [0,1] because of
the use of the noconstant option in regress
mat r = (2.5) // Value you want to test for equality
mat W = (R*b-r)'*inv(R*V*R')*(R*b-r)
// This is where it breaks for you, because with perfect collinearity, V == 0
reg mpg2 mpg, nocons
test mpg = 2.5
sca F = r(F)
sca list F
mat list W
Now, as #Brendan Cox suggested, you might be able to simply use the missing value returned in r(p) to condition your replace command. Depending on exactly how you are using it. A word of caution on this, however, is that when the relationship between some x and y is such that y = 2x, and you want to test x = 5 vs test x = 2, you will want to be very careful about the interpretation of missing p-values - In both cases they are classified as type == 1, where the test x = 2 command should not result in that outcome.
Another work-around would be to simply set p = 0 in these cases, since the variance estimate will asymptotically approach 0 as the linear relationship becomes near perfect, and thus the Wald statistic will approach infinity (driving p down, all else equal).
A final yet more complicated work-around in this case could be to calculate the F-statistic manually using the formula in the manual, and setting V to some arbitrary, yet infinitesimally small number. I've included code to do this below, but it is quite a bit more involved than simply issuing the test command, and in truth only an approximation of the actual p-value from the F distribution.
clear *
sysuse auto
gen i = ceil(_n/5)
qui sum i
gen mpg2 = mpg * 2 if i <= 5 // Get different estimation results
replace mpg2 = mpg * 10 if i > 5 // over different subsets of data
gen type = .
local N = _N // use for d.f. calculation later
local iMax = r(max) // use to iterate loop
forvalues i = 1/`iMax' {
qui reg mpg2 mpg if i == `i', nocons
mat b`i' = e(b) // collect returned results for Wald stat
mat V`i' = e(V)
sca cov`i' = V`i'[1,1]
mat R`i' = (1)
mat r`i' = (2) // Value you wish to test against
if (cov`i' == 0) { // set V to be very small if Variance = 0 & calculate Wald
mat V`i' = 1.0e-14
}
mat W`i' = (R`i'*b`i'-r`i')'*inv(R`i'*V`i'*R`i'')*(R`i'*b`i'-r`i')
sca W`i' = W`i'[1,1] // collect Wald statistic into scalar
sca p`i' = Ftail(1,`N'-2, W`i') // pull p-value from F dist
if p`i' > .05 {
replace type = 1 if i == `i'
}
}
Also note that this workaround will become slightly more involved if you want to test multiple coefficients.
I'm not sure if I advise these approaches without issuing a word of caution considering you are in a very real sense "making up" variance estimates, but without a variance estimate you wont be able to test the coefficients at all.

Plot confidence interval efficiently

I want to plot confidence intervals for some estimates after running a regression model.
As I'm working with a very big dataset, I need an efficient solution: in particular, a solution that does not require me to sort or save the dataset. In the following example, I plot estimates for b1 to b6:
reg y b1 b2 b3 b4 b5 b6
foreach i of numlist 1/6 {
local mean `mean' `=_b[b`i']' `i'
local ci `ci' ///
(scatteri ///
`=_b[b`i'] +1.96*_se[b`i']' `i' ///
`=_b[`i'] -1.96 * _se[b`i']' `i' ///
,lpattern(shortdash) lcolor(navy))
}
twoway `ci' (scatteri `mean', mcolor(navy)), legend(off) yline(0)
While scatteri efficiently plots the estimates, I can't get boundaries for the confidence interval similar to rcap.
Is there a better way to do this?
Here's token code for what you seem to want. The example is ridiculous. It's my personal view that refining this would be pointless given the very accomplished previous work behind coefplot. The multiplier of 1.96 only applies in very large samples.
sysuse auto, clear
set scheme s1color
reg mpg weight length displ
gen coeff = .
gen upper = .
gen lower = .
gen which = .
local i = 0
quietly foreach v in weight length displ {
local ++i
replace coeff = _b[`v'] in `i'
replace upper = _b[`v'] + 1.96 * _se[`v'] in `i'
replace lower = _b[`v'] - 1.96 * _se[`v'] in `i'
replace which = `i' in `i'
label def which `i' "`v'", modify
}
label val which which
twoway scatter coeff which, mcolor(navy) xsc(r(0.5, `i'.5)) xla(1/`i', val) ///
|| rcap upper lower which, lcolor(navy) xtitle("") legend(off)