Stata: Difference in mean with Survey data - stata

I am fairly new to Stata so some of my questions may be pretty basic, but I appreciate any help I can get. An example of my data is below
wage surveyweights Constant Sex
32 14 .56 1
56 45 .96 1
77 88 .25 0
I have survey data and I am trying to bootstrap the difference in mean wages for women ONLY. I want to find out the difference in mean outcome with [survey weights] and with [survey weights * constant] that is Manual calculation
mean x [pw=wt] if sex==1
mat x1=e(b)
mean x [pw=wt*constant] if sex==1
mat x2= e(b)
mat dd=x1-x2
**my outcome of interest is the point estimate and bootstrapped SE of dd
I have written the following program, but I am not getting the results I want. In particular, my SE column turns up blank and my point estimates for the means and the difference in means do not match with what I get with manual calculation.
program define meandiff, eclass properties (svyb)
args vars
mean `vars'
mat x1=e(b)
mean `vars' [pw=constant]
mat x2=e(b)
mat dd = x1-x2
ereturn scalar dd=e1(dd,1,1)
end
local vars wage
svy bootstrap e(dd), subpop(sex): means `vars'
I have already svyset my data using bootstrap weights. My questions are as follows:
When I type svy: mean 'vars' the program seems to start running bootstrap replications, and when I type svy bootstrap: means `vars' also the program seems to start replications. What is the difference between the two commands?
When I do mean x with regular survey weights do I need to do [pw=wt] or will svy command automatically apply the survey weights?
If I do have to write [pw=wt] in the first mean then do I need to create a variable called, say, gen wtxcons = wt * constant to do [pw=wtxcons] when I calculate the second mean?
How do I calculate the bootstrap SE and point estimates for my outcome of interest, which is the difference in means. Why are my point estimates not matching my manual calculation?

Related

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

How do I store regression results from loops in Stata?

I have built a model which basically does the following:
run regressions on single time period
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
The pair of variables are just the final entries in the timeframe. However, I intend to extend the single time period to rolling through a large timeframe, in essence:
for i in timeperiod {
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
}
The data I'm after is the portfolio 1 and 10 returns for the final day of each timeframe (built using the previous 3 years of data). This should result in a time series (of my total data 60 -3 years to build first result, so 57 years) of returns which I can then regress against eachother.
regress portfolio 1 against portfolio 10
I am coming from an R background, where storing a variable in a vector is very simple, but I'm not quite sure how to go about this in Stata.
In the end I want a 2xn matrix (a separate dataset) of numbers, each pair being results of one run of a rolling regression. Sorry for the very vague description, but it's better than explaining what my model is about. Any pointers (even if it's to the right manual entry) will be much appreciated. Thank you.
EDIT: The actual data I want to store is just a variable. I made it confusing by adding regressions. I've changed the code to more represent what I want.
Sounds like a case for either rolling or statsby, depending on what you exactly want to do. These are prefix commands, that you prefix to your regression model. rolling or statsby will take care of both the looping and storing of results for you.
If you want maximum control, you can do the loop yourself with forvalues or foreach and store the results in a separate file using post. In fact, if you look inside rolling and statsby (using viewsource) you will see that this is what these commands do internally.
Unlike R, Stata operates with only one major rectangular object in memory, called (ta-da!) the data set. (It has a multitude of other stuff, of course, but that stuff can rarely be addressed as easily as the data set that was brought into memory with use). Since your ultimate goal is to run a regression, you will either need to create an additional data set, or awkwardly add the data to the existing data set. Given that your problem is sufficiently custom, you seem to need a custom solution.
Solution 1: create a separate data set using post (see help).
use my_data, clear
postfile topost int(time_period) str40(portfolio) double(return_q1 return_q10) ///
using my_derived_data, replace
* 1. topost is a placeholder name
* 2. I have no clue what you mean by "storing the portfolio", so you'd have to fill in
* 3. This will create the file my_derived_data.dta,
* which of course you can name as you wish
* 4. The triple slash is a continuation comment: the code is coninued on next line
levelsof time_period, local( allyears )
* 5. This will create a local macro allyears
* that contains all the values of time_period
foreach t of local allyears {
regress outcome x1 x2 x3 if time_period == `t', robust
* 6. the opening and closing single quotes are references to Stata local macros
* Here, I am referring to the cycle index t
organise_stocks_into_quantiles_based_on_coefficient_from_linear_regression
* this isn't making huge sense for me, so you'll have to put your code here
* don't forget inserting if time_period == `t' as needed
* something like this:
predict yhat`t' if time_period == `t', xb
xtile decile`t' = yhat`t' if time_period == `t', n(10)
calculate_portfolio_returns_for_stocks_based_on_quantile
forvalues q=1/10 {
* do whatever if time_period == `t' & decile`t' == `q'
}
* store quantile 1 portolio and quantile 10 return for the last period
* again I am not sure what you mean and how to do that exactly
* so I'll pretend it is something like
ratio change / price if time_period == `t' , over( decile`t' )
post topost (`t') ("whatever text describes the time `t' portfolio") ///
(_b[_ratio_1:1]) (_b[_ratio_1:10])
* the last two sets of parentheses may contain whatever numeric answer you are producing
}
postclose topost
* 7. close the file you are creating
use my_derived_data, clear
tsset time_period, year
newey return_q10 return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
Solution 2: keep things within your current data set. If the above code looks awkward, you can instead create a weird centaur of a data set with both your original stocks and the summaries in it.
use my_data, clear
gen int collapsed_time = .
gen double collapsed_return_q1 = .
gen double collapsed_return_q10 = .
* 1. set up placeholders for your results
levelsof time_period, local( allyears )
* 2. This will create a local macro allyears
* that contains all the values of time_period
local T : word count `allyears'
* 3. I now use the local macro allyears as is
* and count how many distinct values there are of time_period variable
forvalues n=1/`T' {
* 4. my cycle now only runs for the numbers from 1 to `T'
local t : word `n' of `allyears'
* 5. I pull the `n'-th value of time_period
** computations as in the previous solution
replace collapsed_time_period = `t' in `n'
replace collapsed_return_q1 = (compute) in `n'
replace collapsed_return_q10 = (compute) in `n'
* 6. I am filling the pre-arranged variables with the relevant values
}
tsset collapsed_time_period, year
* 7. this will likely complain about missing values, so you may have to fix it
newey collapsed_return_q10 collapsed_return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
I avoided statsby as it overwrites the data set in memory. Remember that unlike R, Stata can only remember one data set at a time, so my preference is to avoid excessive I/O operations as they may well be the slowest part of the whole thing if you have a data set of 50+ Mbytes.
I think you're looking for the estout command to store the results of the regressions.

Combining a list of observations into one single variable in Stata

I ran 500 simulations in Stata, i.e. I draw 500 samples, and each sample contains 10 observations. I want to generate a mean for each sample and combine all the 500 means into one variable, because I need to plot a histogram of the means. Currently I have 500 samples, named X1, X2, ... X500, where each X has 10 elements in it. I want to get a mean for each X and plot a histogram of the means. Can someone please show me how to do that? I tried to generate a new variable for the mean, i.e. X1mean = mean(X1), but this wouldn't work, because all 10 empty elements would be filled with the mean.
"Please tell me the code" questions are widely considered off-topic here. See https://stackoverflow.com/help/on-topic : "Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results."
There are various ways to do this. One is to collapse and then xpose or reshape long. In fact, you could have produced a combined sample of 500 x 10 in the first place.
Another is to loop over variables like this
set obs 500
gen mean = .
quietly forval j = 1/500 {
su X`j', meanonly
replace mean = r(mean) in `j'
}
histogram mean
What you are presumably alluding to is code such as
egen X1mean = mean(X1)
That would be no use, but not for the reason you mention, as identical values can always be ignored: it would be no use because similar code would just produce 500 more variables. Note that mean() would not work with generate as mean() is an egen function.
The terminology you seek is observations, not elements.

Block bootstrap with indicator variable for each block

I want to run block bootstrap, where the blocks are countries, and include country indicator variables. I thought the following would work.
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
But I get the following error.
. regress mvalue kstock i.country, vce(bootstrap, cluster(country))
(running regress on estimation sample)
Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxx 50
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
It seems that this should work. If the block bootstrap picks the same country for every block, then it seems it should just drop the intercept.
Is my error coding or conceptual? Here is some code using the grunfeld data.
webuse grunfeld, clear
xtset, clear
generate country = int((company - 1) / 2) + 1
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
The problem here is not with your coding, but is conceptual. The problem is that you cannot identify each coefficient in each regression in each bootstrap sample. Not all "countries" are included in the dataset for each bootstrap repetition. You can diagnose what is going on with the vce( , noisily) sub-option:
. regress mvalue kstock i.bscountry, vce(bootstrap, cluster(country) noisily)
Errors are generated because some coefficients are missing when the regression runs with particular bootstrap samples. In each regression you can see that some countries dummies are being omitted due to collinearity. This should be expected and makes a lot of sense -- the country dummies could =0 for all observations in the bootstrap sample if the country was not drawn!
If you are really trying to estimate the coefficients on the country dummies, you are going to have to find another approach than bootstrapping with K clusters if K is the number of countries. If you don't care about the coefficient dummies you could use another command that simply absorbs the fixed effects and only reports the coefficients on the other independent variables (e.g., areg or xtreg). One way think about what is going on is that it is analogous to this:
.bootstrap, cluster(country) idcluster(bscountry) noisily: regress mvalue kstock i.bscountry
With the idcluster() option, each country that is drawn in a bootstrap sample is given its own ID number. If a country is drawn twice then there are two dummies. (The coefficients for the two dummies naturally turn out to be identical or near-identical.) However, the coefficients in this output are are completely meaningless because bscountry "2" will be different countries in different bootstrap iterations. Since you would ignore any output on the dummies, you might as well use a model like areg or xtreg since they run more quickly.
Although there are many applications where bootstrapping with clusters would work fine, the problem here is the inclusion of cluster dummies in the regression. This all begs the question of whether this exercise makes any sense at all. If you are trying to estimate the coefficients for the country dummies, then certainly not. Otherwise, the solutions above might be OK, but it is hard to say without knowing your research question.