Prediction after Boxcox model in Stata - stata

I am trying to match the predict option after boxcox in Stata 13 with my code using the steps described in Stata manual (page 5).
Following is the sample code I used:
sysuse auto,clear
local indepvar weight foreign length
qui boxcox price `indepvar' ,model(lhsonly)lrtest
qui predict yhat1
qui predict resid1, residuals
//yhat2 and resid2 computed using the procedure described in Stata manual
set more off
set type double
mat coef=e(b)
local nosvar=colsof(coef)-2
qui gen constant=1
local varname weight foreign length constant
local coefname weight foreign length _cons
//step 1: compute residuals first
forvalues k = 1/`nosvar'{
local varname1 : word `k' of `varname'
local coefname1 : word `k' of `coefname'
qui gen xb`varname1'=`varname1'*_b[`coefname1']
}
qui egen xb=rowtotal(xb*)
qui gen resid=(price^(_b[theta:_cons]))-xb
//step 2: compute predicted value
qui gen yhat2=.
local noobs=_N
local theta=_b[theta:_cons]
forvalues j=1/`noobs'{
qui gen temp`j'=.
forvalues i=1/`noobs'{
qui replace temp`j'=((`theta'*(xb[`j']+resid[`i']))+1)^(1/`theta') if _n==`i'
}
qui sum temp`j'
local tempmean`j'=r(mean)
qui replace yhat2=`tempmean`j'' if _n==`j'
drop temp`j'
}
drop resid
qui gen double resid2=price-yhat2
sum yhat* resid*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
yhat1 | 74 6254.224 2705.175 3428.361 21982.45
yhat2 | 74 1.000035 8.13e-06 1.000015 1.000054
resid1 | 74 -88.96723 2094.162 -10485.45 6980.013
resid2 | 74 6164.257 2949.496 3290 15905
Note: yhat1 and resid1 are based on Stata predict, while yhat2 and resid2 are based on my sample code. The comparison is needed to make sure the marginal effect I computed is correct (margins doesn't compute the marginal effect after boxcox).

Your definition of the first residual is wrong, because you missed the definition of y^(\lambda) on page 3 of the Manual. See also the Methods and Formulas section in the Manual entry for boxcox itself.
Translated to your problem, in the line
qui gen resid=(price^(_b[theta:_cons]))-xb
the term
price^(_b[theta:_cons])
should be:
(price^(_b[theta:_cons])-1)/_b[theta:_cons]

Related

How to add ttest results to esttab

I am constructing a table of means and p-values from ttest. How can I get all of this in the same esttab table? Here is a MWE:
Get the sample, save it as a temporary file, create a local with the variables we will consider, create a local that is the length of the first local:
sysuse auto2, clear
*create two groups: 0 and 1
gen group = _n<37
tempfile a
save `a'
local vars "price headroom trunk weight"
local vars_n: word count `vars'
ssc install estout
eststo clear
Calculate the means of group 0 (column 1) and group 2 (column 2):
*group 0 means
use `a', clear
keep if group==0
eststo: estpost sum `vars'
*group 1 means
use `a', clear
keep if group==1
eststo: estpost sum `vars'
Conduct t-tests for each variable (is there an easier way to do this?):
*t-test
*create blank matrix
matrix pval = J(`vars_n',1,.)
use `a', clear
forvalues i=1/`vars_n' {
local var `: word `i' of `vars''
ttest `var', by(group)
*add the two-sided p-value to matrix
matrix pval[`i',1]=r(p)
}
This previous block of code saves the p-values (column 3) into a matrix.
Use esttab to output the results:
esttab, cells(mean(fmt(2))) collabels(none) nodepvars nonumber replace label
esttab matrix(pval, fmt(2 0))
My issue is that I need to have the p-values in the same esttab as the means, but I currently have them in a matrix. How can I use something like eststo: estpost to get them so that I can use esttab (as opposed to esttab matrix)? Or is there a better way to do all of this? My goal is to run esttab, cells(mean(fmt(2))) collabels(none) nodepvars nonumber replace label and have it create a table with the first two columns being the means and the third column being the p-values.
All the information you need is in estpost ttest, so an easy solution would be this:
sysuse auto2, clear
gen group = _n<37
local vars price headroom trunk weight
estpost ttest `vars', by(group)
esttab ., cells("mu_1 mu_2 p") nonumber label
-----------------------------------------------------------
mu_1 mu_2 p
-----------------------------------------------------------
Price 5847.526 6500.639 .3445597
Headroom (in.) 2.828947 3.166667 .0861206
Trunk space (.. ft.) 12.39474 15.19444 .0041618
Weight (lbs.) 2654.474 3404.722 .0000115
-----------------------------------------------------------
Observations 74
-----------------------------------------------------------

Simulating AR(1) in Stata from last observation to the first

I want to simulate an AR(1) process, but start from the end. But my code does not work as expected:
clear
set obs 100
gen et=rnormal(0,1)
quietly gen yt= et in L
quietly replace yt=0.5*yt[_n+1]+et in 1/L-1
Your help is really appreciated.
Just do it the normal way and then reverse order:
clear
set obs 100
gen obs = -_n
gen et=rnormal(0,1)
quietly gen yt = et in 1
quietly replace yt = 0.5*yt[_n-1] + et in 2/L
sort obs
The key is that Stata works in order of the observations. So, this code works as you would want in cascade, value for observation 2 depending on observation 1, 3 on 2, and so forth.
You won't get a cascade going the other direction.
Also, set seed for reproducibility.

Fixed effects regression with a loop is not workling

I do following fixed effects regression with a loop. I always get an error "option if is not allowed"!
levelsof Sic, local(Sic)
xtset Year
foreach i of local Sic {
xtreg y mq r d, fe if Sic == `i'
eststo
}
if i do the same regression with a normal OLS regression, its working without any problems. why?
The if qualifier should come before the comma, options afterwards.
levelsof Sic, local(Sic)
foreach i of local Sic {
eststo: xtreg y mq r d if Sic == `i', fe
}
I assume it worked with OLS because you did not have to specify any options.

Stata: DFL Decomposition and Bootstrapping with Complex Survey design

Hello I am trying to do a DFL style reweighting with bootstrap weights and SEs. I have a 2 stage stratified sample over 5 rounds (repeated cross section).
The idea is to create counterfactual weights for the reference population and then find the difference in mean outcomes for the two groups. This difference can be divided into three parts
Total difference (group 1 - group 2 , both using survey weights)
Explained difference (group 2 using counterfactual weights- group 2 using survey weights)
Unexplained difference (group 1 using survey weights- group 2 using counterfactual weights)
I have written the following program for the same
Code:
///to make sure there is no singleton strata
egen cluster_id= group(sector state region strat fsu)
egen stratum_id= group(sector state region strat)
foreach r in 1 2 3 4 5 {
preserve
qui keep if round==`r'
qui svyset cluster_id [pw=hhwt] , strata (startum_id)
qui unique cluster_id, by (startum_id) gen (dup)
qui by startum_id, sort: egen temp= total(dup)
count if temp==1
drop if temp==1
drop temp dup
save "C:\Users\Round 2 Data\bs_round`r'", replace
restore
}
Code:
///final data we will use
use "C:\Users\Round 2 Data\bs_round1"
foreach r in 2 3 4 5 {
qui merge m:m round using "C:\Users\Round 2 Data\bs_round`r'"
drop _merge
sort round
tab round
}
save "C:\Users\Round 2 Data\bs_all"
Code:
///constructing bootstrap weights
egen pooled_cid= group (cluster_id round)
egen pooled_sid= group (stratum_id round)
svyset pooled_cid [pw=hhwt], strata( pooled_sid)
bsweights bsw, reps(100) n(-1)
svyset pooled_cid [pw=hhwt], strata( pooled_sid) bsrweight(bsw*) vce(bootstrap)
Code:
///writing the program
#delimit ;
capture program drop mydfl;
program define mydfl, eclass properties (svyb);
version 13;
args wgtname xvars outcome;
gen groupref=(group==1);
egen countg1=sum(group==1);
egen countg2=sum(group==2);
logit groupref `xvars';
predict phatref;
gen `wgtname'2=(phatref/(1-phatref))*(countg2/countg1) if group==2;
replace `wgtname'2=1 if group==1;
gen `wgtname'1=((1-phatref)/phatref)*(countg1/countg2) if group==1;
replace `wgtname'1=1 if group==2;
drop phatref groupref countg*;
forvalues i=1/2 {;
sum `wgtname'`i' if group==`i';
replace `wgtname'`i' = `wgtname'`i' / r(mean) if group==`i';
};
mean `outcome' if group==1 ;
mat diff_1=e(b) ;
mean `outcome' if group==2 ;
mat diff_2=e(b) ;
mean `outcome' if group==2 [pw=`wgtname'2] ;
mat diff_3=e(b) ;
mat dd_t = diff_1-diff_2 ;
mat dd_e= diff_3-diff_2 ;
mat dd_u= diff_1-diff_3 ;
ereturn scalar dd_tot=e1(dd_t,1,1) ;
ereturn scalar dd_exp=e1(dd_e,1,1) ;
ereturn scalar dd_unex=e1(dd_u,1,1) ;
end;
Code:
///running the program
local xvars age i.state fhead yrs_ed marital rural
local outcome wage
svy bootstrap e(dd_tot) e(dd_exp) e(dd_unex): mydfl wtid "`xvars'" `outcome'
I want to find the standard error for the mean gap, mean explained gap and mean unexplained gap in outcome-in this case wage of the two groups.
I keep getting the following error (after the program creates wtid1 and wtid2)
Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 50
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 100
insufficient observations to compute bootstrap standard errors
no results will be saved
What am I doing wrong?
Also posted on http://www.statalist.org/forums/forum/general-stata-discussion/general/1309830-dfl-decomposition-and-bootstrapping-with-complex-survey-design
A certain cause of bootstrap failure is that the program creates permanent variables.
Here is the first generate statement:
gen groupref=(group==1);
bootstrap first runs the program on the entire data set, and the variable groupref is added. Next, the first bootstrap replica is drawn, and the program is run on that replicate. The generate statement will now silently fail because the variable already exists. The entire program will therefore fail and the only indication will be the "X" in the Stata results.
The solution is to designate all variables created by generate, egen, or predict as temporary variables. These will be dropped after each replicate is analyzed. Here is the usage:
tempvar groupref;
gen `groupref' = (group==1);
tempvar is a local macro and can take a list of names. Similar macros are tempname and tempfile.

Use esttab to generate summary statistics by group with columns for mean difference and significance

I would like to use esttab (ssc install estout) to generate summary statistics by group with columns for the mean difference and significance. It is easy enough to generate these as two separate tables with estpost, summarize, and ttest, and combine manually, but I would like to automate the whole process.
The following code generates the two components of the desired table.
sysuse auto, clear
* summary statistics by group
eststo clear
by foreign: eststo: quietly estpost summarize ///
price mpg weight headroom trunk
esttab, cells("mean sd") label nodepvar
* difference in means
eststo: estpost ttest price mpg weight headroom trunk, ///
by(foreign) unequal
esttab ., wide label
And I can print the two tables and cut-an-paste into one table.
* can generate similar tables and append horizontally
esttab, cells("mean sd") label
esttab, wide label
* manual, cut-and-paste solution
-------------------------------------------------------------------------------------------------------
(1) (2) (3)
mean sd mean sd
-------------------------------------------------------------------------------------------------------
Price 6072.423 3097.104 6384.682 2621.915 -312.3 (-0.44)
Mileage (mpg) 19.82692 4.743297 24.77273 6.611187 -4.946** (-3.18)
Weight (lbs.) 3317.115 695.3637 2315.909 433.0035 1001.2*** (7.50)
Headroom (in.) 3.153846 .9157578 2.613636 .4862837 0.540** (3.30)
Trunk space (.. ft.) 14.75 4.306288 11.40909 3.216906 3.341*** (3.67)
-------------------------------------------------------------------------------------------------------
Observations 52 22 74
-------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
It seems that I should be able to get the desired table with one esttab call and without cutting-and-pasting, but I can't figure it out. Is there a way to generate the desired table without manually cutting-and-pasting?
I would prefer to output a LaTeX table, but anything that eliminates the cutting-and-pasting is a big step, even passing through a delimited text file.
If you still want to use esttab, you can play around using cells and pattern. The table in the original post can be replicated with the following code:
sysuse auto, clear
eststo domestic: quietly estpost summarize ///
price mpg weight headroom trunk if foreign == 0
eststo foreign: quietly estpost summarize ///
price mpg weight headroom trunk if foreign == 1
eststo diff: quietly estpost ttest ///
price mpg weight headroom trunk, by(foreign) unequal
esttab domestic foreign diff, ///
cells("mean(pattern(1 1 0) fmt(2)) sd(pattern(1 1 0)) b(star pattern(0 0 1) fmt(2)) t(pattern(0 0 1) par fmt(2))") ///
label
which yields
-----------------------------------------------------------------------------------------------------
(1) (2) (3)
mean sd mean sd b t
-----------------------------------------------------------------------------------------------------
Price 6072.42 3097.10 6384.68 2621.92 -312.26 (-0.44)
Mileage (mpg) 19.83 4.74 24.77 6.61 -4.95** (-3.18)
Weight (lbs.) 3317.12 695.36 2315.91 433.00 1001.21*** (7.50)
Headroom (in.) 3.15 0.92 2.61 0.49 0.54** (3.30)
Trunk space (.. ft.) 14.75 4.31 11.41 3.22 3.34*** (3.67)
-----------------------------------------------------------------------------------------------------
Observations 52 22 74
-----------------------------------------------------------------------------------------------------
I don't think there's a way to do this with esttab (estout package from ssc), but I have a solution with listtab (also ssc) and postfile. The table here is a little different than the one I propose above, but the approach is general enough that you can modify it to fit your needs.
This solution also use LaTeX's booktabs package.
/* data and variables */
sysuse auto, clear
local vars price mpg weight headroom trunk
/* means */
tempname postMeans
tempfile means
postfile `postMeans' ///
str100 varname domesticMeans foreignMeans pMeans using "`means'", replace
foreach v of local vars {
local name: variable label `v'
ttest `v', by(foreign)
post `postMeans' ("`name'") (r(mu_1)) (r(mu_2)) (r(p))
}
postclose `postMeans'
/* medians */
tempname postMedians
tempfile medians
postfile `postMedians' ///
domesticMedians foreignMedians pMedians using `medians', replace
foreach v of local vars {
summarize `v' if !foreign, detail
local med1 = r(p50)
summarize `v' if foreign, detail
local med2 = r(p50)
ranksum `v', by(foreign)
local pval = 2 * (1 - normal(abs(r(z))))
post `postMedians' (`med1') (`med2') (`pval')
}
postclose `postMedians'
/* combine */
use `means'
merge 1:1 _n using `medians', nogenerate
format *Means *Medians %9.3gc
list
/* make latex table */
/* requires LaTeX package `booktabs` */
listtab * using "Table.tex", ///
rstyle(tabular) replace ///
head("\begin{tabular}{lcccccc}" ///
"\toprule" ///
"& \multicolumn{3}{c}{Means} & \multicolumn{3}{c}{Medians} \\" ///
"\cmidrule(lr){2-4} \cmidrule(lr){5-7}" ///
"& Domestic & Foreign & \emph{p} & Domestic & Foreign & \emph{p}\\" ///
"\midrule") ///
foot("\bottomrule" "\end{tabular}")
This yields the following.
The answer chosen is nice but a bit redudant. You can achieve the same result with only estpost ttest.
sysuse auto, clear
estpost ttest price mpg weight headroom trunk, by(foreign)
esttab, cells("mu_1 mu_2 b(star)"
The output looks like this:
mu_1 mu_2 b
c_score 43.33858 42.034 1.30458***
nc_a4_17 4.007524 3.924623 .0829008*