I am attempting to demonstrated characteristics of various tests for small samples of data. I would like to demonstrate the performance of the t-test, t-test with bootstrap estimation and the ranksum test. I am interested in obtaining the p-value for each test on multiple sets of data using simulate. However, I cannot obtain t-test estimates using the bootstrap prefix and ttest command.
The data is generated by:
clear
set obs 60
gen level = abs(rnormal(0,1))
gen group = "A"
replace group = "B" if [_n] >30
bootstrap, reps(100): ttest level, by(group)
bootstrap _b, reps(100): ttest level, by(group)
bootstrap boot_p = e(p), reps(100): ttest level, by(group)
The errors for each of the procedures in order are:
expression list required
invalid expression: _b
'e(p)' evaluated to missing in full sample
These results are not consistent with the documentation for the bootstrap prefix. Is there some problem with specification of e or r class objects and ttest ?
Edit:
Understanding now that r-class is the correct group of scalars, I still do not generate a variable 'p' given the code provided in the solution. Additionally:
clear
set more off
set obs 60
gen level = abs(rnormal(0,1))
gen group = "A"
replace group = "B" if [_n] >30
bootstrap p=r(p), reps(100): ttest level, by(group)
display r(p)
does not return the p-value.
ttest is an r-class command and stores its reults in r(). You seem to expect for it to save results in e(), like an e-class command. The norm is that the latter kind fit models; ttest is not in this category.
The two-sided p-value is stored in r(p), as indicated in help ttest:
clear
set more off
set obs 60
gen level = abs(rnormal(0,1))
gen group = "A"
replace group = "B" if [_n] >30
bootstrap p=r(p), reps(100): ttest level, by(group)
Related
I have run a CEM matching process on my data using Stata, and now I would like to know how to run a t-test on the variables of the matched data.
/* Simple example of my code; first I run the CEM */
cem age gender education, treatment(treat)
/* Then I want to have a look at the summary statistics of the entire population and the matched data (this code works fine) */
summarize age gender education
summarize age gender education [iweight=cem_weights]
/* But if I want to do a t test only on the matched data, I get an error with the weights */
ttest age, by(treat) /* works fine */
ttest age [iweight=cem_weights], by(treat) /* error saying that weights are not allowed */
How can I run t tests only on the matched data? An option could also be to export the matched data, so how could I do that?
Something like this should do it:
sysuse auto, clear
cem mpg weight rep78, treatment(foreign)
mean price [iweight = cem_weight], over(foreign) //coeflegend
lincom _b[c.price#1.foreign] - _b[c.price#0.foreign]
Personally, I would just use regress with het-robust standard errors (which gets similar results here, and will get even closer with a bigger sample):
regress price i.foreign [iweight = cem_weight], robust
So I am running an 2SLS model by interview year and I have many interview years and different models. I want to present the first-stage results first and then after reassuring the reader that they are solid move on to the interesting results.
Example of Table A (first stage):
Year DV Coef SE F N
1 A 0.5 0.1 100 1000
2 A 0.8 0.2 10 1500
3 B -0.6 0.4 800 800
Table B with the main results would look the same just without the F-Stat.
I searched on the web about how to create those tables automatically in Stata, but despite finding many questions I didn't find an answer that worked for me. From those different posts and help-files I build something that is nearly there.
It creates the table I want for the main results with the F-Stat together by some variable (Step A in the code). However, when I move on to do the same for the first stage it only saves the last wave as I restore the estimates. I understand why Stata does it like that, but I cannot think of a way of convincing it to do what I want.
clear all
*Install user-written commands
ssc install outreg2, replace
ssc install ivreg210, replace
*load data
sysuse auto, clear
*run example model (obviously the model itself is bogus)
********************************************************
*Step A: creates the IV results by foreign plus the F-Statistic
bys foreign: ///
outreg2 using output1-IV-F, label excel stats(coef se) dec(2) adds(F-Test, e(widstat)) nocons nor2 keep(mpg) replace: ///
ivreg210 price headroom trunk (mpg=rep78 ), savefirst first
*Step B: creates the first stage results in a seperate table
bys foreign: ///
ivreg210 price headroom trunk (mpg=rep78 ), savefirst first
est restore _ivreg210_mpg
outreg2 using output1_1st-stage, replace keep(rep78)
cap erase output1-IV-F
cap erase output1_1st-stage
So ideally I would only run the model once and have the F-Stat in the first-stage table, but I can fix that manually. The biggest issue I have is how to store the estimates when using bysort. If anyone has any suggestions about that, I would greatly appreciate it.
Thanks!
ssc install estout
then you can store whatever result you want for later use, even after a bysort.
eststo clear
sysuse auto, clear
bysort foreign: eststo: reg price weight mpg
esttab, label nodepvar nonumber
This is a round-about solution. It works, but really isn't the proper solution I was/am looking for. The "trick" is to run the 1st stage as a separate model.
clear all
*Install user-written commands
ssc install outreg2, replace
ssc install ivreg210, replace
*load data
sysuse auto, clear
*run example model (obviously the model itself is bogus)
********************************************************
*Step A: creates the IV results by foreign plus the F-Statistic
bys foreign: ///
outreg2 using output1-IV-F, label excel stats(coef se) dec(2) adds(F-Test, e(widstat)) nocons nor2 keep(mpg) replace: ///
ivreg210 price headroom trunk (mpg=rep78 ), savefirst first
*Step B: creates the first stage results in a seperate table
bys foreign: ///
ivreg210 price headroom trunk (mpg=rep78 ), savefirst first
est restore _ivreg210_mpg
outreg2 using output1_1st-stage1, replace keep(rep78)
*************
/* NEW BIT */
*************
*Step C: creates the first stage results in a seperate table
bys foreign: ///
outreg2 using output1_1st_NEW, label excel stats(coef se) dec(2) nocons nor2 keep(rep78) replace: ///
reg mpg headroom trunk rep78
cap erase output1-IV-F
cap erase output1_1st-stage1
cap erase output1_1st_NEW
In what follows I plot the mean of an outcome of interest (price) by a grouping variable (foreign) for each possible value taken by the fake variable time:
sysuse auto, clear
gen time = rep78 - 3
bysort foreign time: egen avg_p = mean(price)
scatter avg_p time if (foreign==0 & time>=0) || ///
scatter avg_p time if (foreign==1 & time>=0), ///
legend(order(1 "Domestic" 2 "Foreign")) ///
ytitle("Average price") xlab(#3)
What I would like to do is to plot the difference in the two group means over time, not the two separate means.
I am surely missing something, but to me it looks complicated because the information about the averages is stored "vertically" (in avg_p).
The easiest way to do this is to arguably use linear regression to estimate the differences:
/* Regression Way */
drop if time < 0 | missing(time)
reg price i.foreign##i.time
margins, dydx(foreign) at(time =(0(1)2))
marginsplot, noci title("Foreign vs Domestic Difference in Price")
If regression is hard to wrap your mind around, the other is involves mangling the data with a reshape:
/* Transform the Data */
keep price time foreign
collapse (mean) price, by(time foreign)
reshape wide price, i(time) j(foreign)
gen diff = price1-price0
tw connected diff time
Here is another approach. graph dot will happily plot means.
sysuse auto, clear
set scheme s1color
collapse price if inrange(rep78, 3, 5), by(foreign rep78)
reshape wide price, i(rep78) j(foreign)
rename price0 Domestic
label var Domestic
rename price1 Foreign
label var Foreign
graph dot (asis) Domestic Foreign, over(rep78) vertical ///
marker(1, ms(Oh)) marker(2, ms(+))
Hello I am trying to do a DFL style reweighting with bootstrap weights and SEs. I have a 2 stage stratified sample over 5 rounds (repeated cross section).
The idea is to create counterfactual weights for the reference population and then find the difference in mean outcomes for the two groups. This difference can be divided into three parts
Total difference (group 1 - group 2 , both using survey weights)
Explained difference (group 2 using counterfactual weights- group 2 using survey weights)
Unexplained difference (group 1 using survey weights- group 2 using counterfactual weights)
I have written the following program for the same
Code:
///to make sure there is no singleton strata
egen cluster_id= group(sector state region strat fsu)
egen stratum_id= group(sector state region strat)
foreach r in 1 2 3 4 5 {
preserve
qui keep if round==`r'
qui svyset cluster_id [pw=hhwt] , strata (startum_id)
qui unique cluster_id, by (startum_id) gen (dup)
qui by startum_id, sort: egen temp= total(dup)
count if temp==1
drop if temp==1
drop temp dup
save "C:\Users\Round 2 Data\bs_round`r'", replace
restore
}
Code:
///final data we will use
use "C:\Users\Round 2 Data\bs_round1"
foreach r in 2 3 4 5 {
qui merge m:m round using "C:\Users\Round 2 Data\bs_round`r'"
drop _merge
sort round
tab round
}
save "C:\Users\Round 2 Data\bs_all"
Code:
///constructing bootstrap weights
egen pooled_cid= group (cluster_id round)
egen pooled_sid= group (stratum_id round)
svyset pooled_cid [pw=hhwt], strata( pooled_sid)
bsweights bsw, reps(100) n(-1)
svyset pooled_cid [pw=hhwt], strata( pooled_sid) bsrweight(bsw*) vce(bootstrap)
Code:
///writing the program
#delimit ;
capture program drop mydfl;
program define mydfl, eclass properties (svyb);
version 13;
args wgtname xvars outcome;
gen groupref=(group==1);
egen countg1=sum(group==1);
egen countg2=sum(group==2);
logit groupref `xvars';
predict phatref;
gen `wgtname'2=(phatref/(1-phatref))*(countg2/countg1) if group==2;
replace `wgtname'2=1 if group==1;
gen `wgtname'1=((1-phatref)/phatref)*(countg1/countg2) if group==1;
replace `wgtname'1=1 if group==2;
drop phatref groupref countg*;
forvalues i=1/2 {;
sum `wgtname'`i' if group==`i';
replace `wgtname'`i' = `wgtname'`i' / r(mean) if group==`i';
};
mean `outcome' if group==1 ;
mat diff_1=e(b) ;
mean `outcome' if group==2 ;
mat diff_2=e(b) ;
mean `outcome' if group==2 [pw=`wgtname'2] ;
mat diff_3=e(b) ;
mat dd_t = diff_1-diff_2 ;
mat dd_e= diff_3-diff_2 ;
mat dd_u= diff_1-diff_3 ;
ereturn scalar dd_tot=e1(dd_t,1,1) ;
ereturn scalar dd_exp=e1(dd_e,1,1) ;
ereturn scalar dd_unex=e1(dd_u,1,1) ;
end;
Code:
///running the program
local xvars age i.state fhead yrs_ed marital rural
local outcome wage
svy bootstrap e(dd_tot) e(dd_exp) e(dd_unex): mydfl wtid "`xvars'" `outcome'
I want to find the standard error for the mean gap, mean explained gap and mean unexplained gap in outcome-in this case wage of the two groups.
I keep getting the following error (after the program creates wtid1 and wtid2)
Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 50
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 100
insufficient observations to compute bootstrap standard errors
no results will be saved
What am I doing wrong?
Also posted on http://www.statalist.org/forums/forum/general-stata-discussion/general/1309830-dfl-decomposition-and-bootstrapping-with-complex-survey-design
A certain cause of bootstrap failure is that the program creates permanent variables.
Here is the first generate statement:
gen groupref=(group==1);
bootstrap first runs the program on the entire data set, and the variable groupref is added. Next, the first bootstrap replica is drawn, and the program is run on that replicate. The generate statement will now silently fail because the variable already exists. The entire program will therefore fail and the only indication will be the "X" in the Stata results.
The solution is to designate all variables created by generate, egen, or predict as temporary variables. These will be dropped after each replicate is analyzed. Here is the usage:
tempvar groupref;
gen `groupref' = (group==1);
tempvar is a local macro and can take a list of names. Similar macros are tempname and tempfile.
For regression output, I usually use a combination of eststo to store estimations, estadd to add the R2 and additional tests, then estab to output the lot.
I need to do the same with the table command. I need the mean, median and N for a variable across three by variables and would like to add stars for the result of a ttest==1 on the mean and signtest==1 on the median. I have three by variables, so I've been using table to collate the mean, median and N, which I'm calling like the following pseudo-code:
sysuse auto,clear
table foreign rep78 , ///
contents(mean price median price n price) format(%9.2f)
ttest price==1, by(foreign rep78)
signtest price=1, by(foreign rep78)
I've tried esttab and estpost to no avail. I've also looked at tabstat, tablemat and summarize as alternatives to table, but they don't allow three by variables.
How can I create this table, add the stars for the ttest and signtest p-values and output the full table?
The main point in your question seems to be producing a LaTeX table. However, you show "pseudo-code", that looks pretty much like Stata code, with the caveat that it is illegal.
In particular, for the ttest you can only have one variable in the by() option. But notice that ttest allows also the by: prefix (you can use both, in fact). Their reasons-to-be are different. On the other hand, signtest does not allow a by() option but it does allow the by: prefix. So you should probably clarify what you want to do before creating the table.
If you are trying to use the by: prefix in both cases and afterwards produce a table, you can create a grouping variable, and put the commands in a loop. In this way, you can try tabulating the saved results for each group using the ESTOUT module (by Ben Jann in SSC). Something like:
*clear all
set more off
sysuse auto
keep price foreign rep78
* create group variable
egen grou = group(foreign rep78)
* tests by group
forvalues i = 1/8 {
ttest price == 1 if grou == `i'
signtest price = 1 if grou == `i'
*<complete with estout syntax>
}
See help by, help egen (the group function), help estout and help saved results.