Add depvar control mean to regression table output

Add depvar control mean to regression table output - stata

This is a 2-arm randomized control trial. In my regression output I want to evaluate the relative reduction in risk of disease for those in the treatment group. To make this evaluation easier I would like to add the dependent variable control mean to the foot of the regression table output. I am currently using estadd with estout. Below is my code, which displays the mean of the dependent variable, however I cannot find any options for estadd, estpost, etc that allow me to restrict the depvar mean calculation for only one arm of the study (i.e. control arm).
eststo, title(" "): xi: quietly reg X `covariates' if survid==1, vce(cluster id1)
estadd ysumm
estout using $outdir\results.txt, replace ///
cells("b(fmt(3) label (Coeff.)) se(fmt(3) star label (s.e.))") ///
drop(_Itt* _cons) ///
starlevels(+ 0.10 * 0.05) ///
stats(N ymean, labels ("N. of obs." "Control Mean")) ///
label legend

You are being spoiled by the marvelous functionality offered by estadd, eststo, etc. :). How about this:
xi: quietly reg X `covariates' if survid==1, vce(cluster id1)
// two prefixes in the same command is like a sentence with three subordinate clauses
// that just rolls from one line to the next without giving you a chance to
// catch a breath and really understand what is going on in this sentence,
// which is a sign of poor writing skills, unless you are Dostoevsky, which I am not.
estimates store CtrlArm
// it is also a good idea to be specific about what it is that you want to output.
// Thus I have the -estimates store- on a separate line with a specific name for your results.
summarize X if survid==1
estadd scalar ymean = r(mean)
estout CtrlArm using $outdir\results.txt, ...
estadd and estout are unavoidable. Your initial eststo with an empty title just takes space, though, and does not help anything.

Related

Remove an entity in recursive regressions with a for loop

I am running a panel data regression. I want to run a for loop to obtain regression results where every time one of the entities has been dropped, so that I can see, by comparing the different coefficients, whether my results are driven by a particular entity or they are consistent across the sample. I am currently using this for loop
forvalues i = 1/19{
use "sample_seven.dta", clear
drop if countryid == i
xtscc ln_gdp tech population inflation tradebalance i.year, fe}
However, when I run the above code what I get is 19 regressions where only two observations have been dropped in each of them.

use "sample_seven.dta", clear
forvalues i = 1/19 {
di _n "Omitting `i'"
xtscc ln_gdp tech population inflation tradebalance i.year if countryid != `i', fe}
}
The error in your code was not using single quotation marks to specify the local macro i. Your code worked, so a reference to i was interpreted as a reference to some variable, possibly inflation.
Beyond that, there is no need to read in the dataset repeatedly. But it would help to spell out which country is being omitted.

Sort by variable in twoway scatter. X-axis stays alphabetical and sort produces gibberish: why?

I have two variables:
ie_ctotal
cntry2
Note: cntry2 is an encoded version of a string variable cntry: I don't know if this may be affecting things.
I want a twoway scatter of ie_ctotal and cntry2, and I want to SORT this scatter by another variable gdppc,
twoway || scatter ie_ctotal cntry2, c(1) xlabel(,valuelabel)
The above without sort works fine. Once I introduce sort, however,
twoway || scatter ie_ctotal cntry2, c(1) sort(gdppc) xlabel(,valuelabel)
The graph turns gibberish, or rather it connects according to the sort, but the x axis remains alphabetical, making the connections seem scribbled.
Any ideas as to what I am doing wrong?
Note: I don't want to sort the original data, because I was advised in previous questions that this is a bad idea. So I want to sort the data only for this one graph.

There is no reproducible example here, and not even a graph, but it is possible to guess the problem.
You are typing above
c(1)
which is ill-advised, although Stata does the right thing. It would be better to type
c(l)
which instructs Stata to join data points on your graph in a line. (Nod to #Dimitriy V. Masterov on this detail.)
In your first example, the values of cntry2 define the x axis.
As you say, the effect of sort(gdppc) is to connect points in order of their values from lowest gdppc to highest. The result is clearly not what you want.
Here is a dopey reproducible example that makes the point.
. sysuse auto, clear
(1978 Automobile Data)
. scatter mpg weight, sort(price) c(l)
You want to sort the countries into gdppc order. This is like sorting make in Stata's auto data according to mpg, but then plotting weight. Here I do this just for foreign cars. It's not a very good graph, but it sounds close in spirit to what you want. This solution requires installation of labmask, for which search labmask and then download from the Stata Journal website.
sysuse auto, clear
keep if foreign
sort mpg
gen obsno = _n
labmask obsno, values(make)
scatter weight obsno, xla(1/22, valuelabel ang(v) noticks) xtitle("")
In a nutshell: the sort() option here defines a connection order; it doesn't map the x axis variable to a reshuffled version. That you need to do before the graphics.
UPDATE In fact, you can get essentially the same graph without any prior manipulation:
graph dot (asis) weight if foreign, over(make, sort(mpg) label(ang(v))) vertical linetype(line) lines(lc(none))
This is going along with the OP's interest in putting labelled categories on the x axis. A graph easier to read would put them on the y axis: then text can be read left to right. To get that, omit the vertical above: that is the default for graph dot. Although the command above omits guide lines by setting their colour to none, very thin light colour guide lines can help.

This uses the trick of encoding using the order of another variable to get the sorting right:
sysuse auto, clear
keep if foreign==1
sencode make, gen(encoded_make) gsort(-weight)
levelsof encoded_make, local(labels)
tw scatter price encoded_make, mlabel(weight) c(1) xlabel(`labels', value angle(45)) sort(weight)
You will need to install sencode from SSC.

create a set of continuous variables from a factor variable and a continuous variable

In Stata, I have a factor variable with 50 levels (state) and an integer-valued variable (year). I want to create 50 new variables: 50 interactions of state indicators with the year variable. Is there a way to do this without writing 50 lines of code?
I can produce the 50 state dummies with tabulate state, generate (state), but I don't know how to get further than that without writing a line to create each individual state-year variable.
I want to use the new state-year variables in a regression. Stata's factor notation makes it easy to include the state-year variables as regressors without creating them beforehand (e.g., with a command like regress y i.state#c.year), but some add-on functions don't support factor notation.

You can try using xi, both as a stand-alone command to create indicator and interaction terms, and as a command prefix. A nonsensical example:
clear all
set more off
sysuse auto
* stand-alone
xi i.rep78*mpg
* as prefix
xi: regress price i.rep78*mpg
Run help xi for all the details.
Edit
To make this a bit clearer, suppose the regress command did not admit the use of either factor variable notation or the xi: prefix. Then using the xi stand-alone syntax you could create the indicator and interaction terms (which answers your original question) and then use those terms with the regress command:
sysuse auto, clear
xi i.rep78*mpg
regress price mpg _Irep78* _IrepXmpg*
(Remember to use Stata's help capabilities. Running search interactions, for example, leads you to xi......Interaction expansion.)

Block bootstrap with indicator variable for each block

I want to run block bootstrap, where the blocks are countries, and include country indicator variables. I thought the following would work.
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
But I get the following error.
. regress mvalue kstock i.country, vce(bootstrap, cluster(country))
(running regress on estimation sample)
Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxx 50
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
It seems that this should work. If the block bootstrap picks the same country for every block, then it seems it should just drop the intercept.
Is my error coding or conceptual? Here is some code using the grunfeld data.
webuse grunfeld, clear
xtset, clear
generate country = int((company - 1) / 2) + 1
regress mvalue kstock i.country, vce(bootstrap, cluster(country))

The problem here is not with your coding, but is conceptual. The problem is that you cannot identify each coefficient in each regression in each bootstrap sample. Not all "countries" are included in the dataset for each bootstrap repetition. You can diagnose what is going on with the vce( , noisily) sub-option:
. regress mvalue kstock i.bscountry, vce(bootstrap, cluster(country) noisily)
Errors are generated because some coefficients are missing when the regression runs with particular bootstrap samples. In each regression you can see that some countries dummies are being omitted due to collinearity. This should be expected and makes a lot of sense -- the country dummies could =0 for all observations in the bootstrap sample if the country was not drawn!
If you are really trying to estimate the coefficients on the country dummies, you are going to have to find another approach than bootstrapping with K clusters if K is the number of countries. If you don't care about the coefficient dummies you could use another command that simply absorbs the fixed effects and only reports the coefficients on the other independent variables (e.g., areg or xtreg). One way think about what is going on is that it is analogous to this:
.bootstrap, cluster(country) idcluster(bscountry) noisily: regress mvalue kstock i.bscountry
With the idcluster() option, each country that is drawn in a bootstrap sample is given its own ID number. If a country is drawn twice then there are two dummies. (The coefficients for the two dummies naturally turn out to be identical or near-identical.) However, the coefficients in this output are are completely meaningless because bscountry "2" will be different countries in different bootstrap iterations. Since you would ignore any output on the dummies, you might as well use a model like areg or xtreg since they run more quickly.
Although there are many applications where bootstrapping with clusters would work fine, the problem here is the inclusion of cluster dummies in the regression. This all begs the question of whether this exercise makes any sense at all. If you are trying to estimate the coefficients for the country dummies, then certainly not. Otherwise, the solutions above might be OK, but it is hard to say without knowing your research question.

Correlation with one variable and a lot of others

In Stata, is there a quick way to show the correlation between a variable and a bunch of dummies. In my data I have an independent variable, goals_scored in a game, and a bunch of dummies for stadium played. How can I show the correlation between the goals_scored and i.stadium in one table, without getting the correlations between stadiums, which I do not care about.

Here's one way:
#delimit;
quietly tab stadium, gen(D); // create dummies
foreach var of varlist D* {;
quietly corr goals_scored `var';
di as text "`: variable label `var'': " as result r(rho);
};
drop D*; // get rid of dummies

cpcorr from SSC (install with ssc inst cpcorr) supports minimal correlation tables, i.e. only the correlations between one set and another set, without the others. But it's an old program (2001) and doesn't support factor variables directly. The indicator variables (a.k.a. dummy variables) would have to exist first.

If you store all of the stadium variables in a local, you would probably loop through them to pull the correlations.

1.
If all stadium variables are placed next to each other in the dataset:
foreach s of varlist stadium1-stadium150 {
// do whatever
}
2a.
If the stadium variables are not next to each other, use order to get there.
2b.
If the variable names follow a pattern, there might be another workaround.
3.
I would not use correlation for this. Depending on the distribution of goals, I would consider something else.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js