Stepwise and time & individual/country fixed effects panels in Stata with appropriate s.e. adjustment - stata

I want to run stepwise on a linear probability model with time and individual fixed effects in a panel dataset but stepwise does not support panels out of the box. The solution is to to run xtdata y x, fe followed by reg y x, r . However, the resulting standard errors are too small. One attempt to solve this problem can be found here: http://www.stata.com/statalist/archive/2006-07/msg00629.html but my panel is highly unbalanced (I have a different number of observations for different variables). I also don't understand how stepwise would include this information in its iterations with different variable lists. Since stepwise bases its decision rules on the pvals, this is quite crucial.
Reproducible Example:
webuse nlswork, clear /*unbalancing a bit:*/
replace tenure =. if year <73
replace hours =. if hours==24 | hours==38
replace tenure =. if idcode==3 | idcode == 12 | idcode == 19
xtreg ln_wage tenure hours union i.year, fe vce(robust)
eststo xtregit /*this is what I want to reproduce without xtreg to use it in stepwise */
xi i.year /* year dummies to keep */
xtdata ln_wage tenure hours union _Iyear*, fe clear
reg ln_wage tenure hours union _Iyear*, vce(hc3) /*regression on transformed data */
eststo regit
esttab xtregit regit
As you can see, the estimates are fine but I need to adjust the standard errors. Also, I need to do that in such a way that stepwise understands in its iterations when the number of variables changes, for example. Any help on how to proceed?

Related

Moving average using forvalues - Stata

I am struggling with a question in Cameron and Trivedi's "Microeconometrics using Stata". The question concerns a cross-sectional dataset with two key variables, log of annual earnings (lnearns) and annual hours worked (hours).
I am struggling with part 2 of the question, but I'll type the whole thing for context.
A moving average of y after data are sorted by x is a simple case of nonparametric regression of y on x.
Sort the data by hours.
Create a centered 15-period moving average of lnearns with ith observation yma_i = 1/25(sum from j=-12 to j=12 of y_i+j). This is easiest using the command forvalues.
Plot this moving average against hours using the twoway connected graph command.
I'm unsure what command(s) to use for a moving average of cross-sectional data. Nor do I really understand what a moving average over one-period data shows.
Any help would be great and please say if more information is needed.
Thanks!
Edit1:
Should be able to download the dataset from here https://www.dropbox.com/s/5d8qg5i8xdozv3j/mus02psid92m.dta?dl=0. It is a small extract from the 1992 Individual-level data from the Panel Study of Income Dynamics - used in the textbook.
Still getting used to the syntax, but here is my attempt at it
sort hours
gen yma=0
1. forvalues i = 1/4290 {
2. quietly replace yma = yma + (1/25)(lnearns[`i'-12] to lnearns[`i'+12])
3. }
There are other ways to do this, but I created a variable for each lag and lead, then take the sum of all of these variables and the original then divide by 25 as in the equation you provided:
sort hours
// generate variables for the 12 leads and lags
forvalues i = 1/12 {
gen lnearns_plus`i' = lnearns[_n+`i']
gen lnearns_minus`i' = lnearns[_n-`i']
}
// get the sum of the lnearns variables
egen yma = rowtotal(lnearns_* lnearns)
// get the number of nonmissing lnearns variables
egen count = rownonmiss(lnearns_* lnearns)
// get the average
replace yma = yma/count
// clean up
drop lnearns_* count
This gives you the variable you are looking for (the moving average) and also does not simply divide by 25 because you have many missing observations.
As to your question of what this shows, my interpretation is that it will show the local average for each hours variable. If you graph lnearn on the y and hours on the x, you get something that looks crazy becasue there is a lot of variation, but if you plot the moving average it is much more clear what the trend is.
In fact this dataset can be read into a suitable directory by
net from http://www.stata-press.com/data/musr
net install musr
net get musr
u mus02psid92m, clear
This smoothing method is problematic in that sort hours doesn't have a unique result in terms of values of the response being smoothed. But an implementation with similar spirit is possible with rangestat (SSC).
sort hours
gen counter = _n
rangestat (mean) mean=lnearns (count) n=lnearns, interval(counter -12 12)
There are many other ways to smooth. One is
gen binhours = round(hours, 50)
egen binmean = mean(lnearns), by(binhours)
scatter lnearns hours, ms(Oh) mc(gs8) || scatter binmean binhours , ms(+) mc(red)
Even better would be to use lpoly.

Stata fixed effects out of sample predictions

I am running a fixed effect model using Stata, and then performing out of sample predictions. But seems
xtreg
followed by
predict yhat, xbu
does not predict out-of-sample along with the fixed effects. Is there a way to use xtreg for out of sample by including the fixed effect? Illustration:
webuse nlswork
xtset idcode year
regress ln_wage age if year <= 80
predict temp1
xtreg ln_wage age if year <= 80, fe
predict temp2, xbu
For my case, I need to predict values for year = 81. And temp2 is empty for years > 80. Reading the manuals for xtreg, and also for areg, seems like out of sample predictions are not possible especially with xbu -- which includes fixed effect predictions. It is understandable that if I used the year fixed effect it does not make sense, but if I just used idcode it should be possible? Any advice would be deeply appreciated. Or any advice on how I can get the solution for it?
It seems to to generate only for all years <= 2000. That is I am able to generate predictions only for in sample.
You can extend the FE out of sample since it is time invariant and then add it to the rest of the prediction, which is available out of sample:
capture ssc install carryforward
xtreg ln_wage age if year <= 80, fe
predict xb_plus_a, xb
predict fe, u
carryforward fe, replace
gen yhat2 = xb_plus_a + fe

Using margins with vce(unconditional) option after xtreg

I am using Stata 13 and I have a balanced panel dataset (t=Year and i=Individual denoted by Year and IndvID respectively) and the following econometric model
Y = b1*var1 + b2*var2 + b3*var1*var2 + b4*var4 + fe + epsilon
am estimating the following fixed-effects regression with year dummies and a linear time trend
xi: xtreg Y var1 var2 c.var1#c.var2 var3 i.Year i.IndvID|Year, fe vce(cluster IndvID)
(all variables are continuous except for dummies being created by i.Year and i.IndvID|Year)
I want Stata to derive/report the overall marginal effect of var1 and var2 on the outcome Y:
dY/dvar1 = b1 + b3*var2
dY/dvar2 = b2 + b3*var1
Because I estimate the fixed-effect regression using robust standard errors, I want to make sure the marginal effect are being computed taking into account the same heterogeneity that the clustered standard errors correct for. My understanding is that this can be achieved using the
vce(unconditional) option of the margins command. However, after running the above regression, when I run the command
margins, dydx(var1) vce(unconditional)
I get the following error:
xtreg is not supported by margins with the vce(unconditional) option
Am I missing something obvious here or am I not going about this correctly? How can I cluster standard errors for margin estimates computed for Stata rather than using the Delta Method default, which doesn't correct for this?
Thanks in advance,
-Mark
The marginal effect of var1 and var2 are functions (of var1 and var2, respectively). If you want the marginal effect of var1 at the mean level of var2, for example, you can use the lincom command after the regression:
sum var2
local m1 = r(mean)
lincom var1 + `m1' * c.var1#c.var2
This calculates the point estimate of the marginal effect at the mean as well as the standard errors derived from the robust covariance matrix estimated in xtreg.

Stata estpost esttab: Generate table with mean of variable split by year and group

I want to create a table in Stata with the estout package to show the mean of a variable split by 2 groups (year and binary indicator) in an efficient way.
I found a solution, which is to split the main variable cash_at into 2 groups by hand through the generation of new variables, e.g. cash_at1 and cash_at2. Then, I can generate summary statistics with tabstat and get output with esttab.
estpost tabstat cash_at1 cash_at2, stat(mean) by(year)
esttab, cells("cash_at1 cash_at2")
Link to current result: http://imgur.com/2QytUz0
However, I'd prefer a horizontal table (e.g. year on the x axis) and a way to do it without splitting the groups by hand - is there a way to do so?
My preference in these cases is for year to be in rows and the statistic (e.g. mean) in the columns, but if you want to do it the other way around, there should be no problem.
For a table like the one you want it suffices to have the binary variable you already mention (which I name flag) and appropriate labeling. You can use the built-in table command:
clear all
set more off
* Create example data
set seed 8642
set obs 40
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
list, sepby(year)
* Define labels
label define lflag 0 "cash0" 1 "cash1"
label values flag lflag
* Table
table flag year, contents(mean cash)
In general, for tables, apart from the estout module you may want to consider also the user-written command tabout. Run ssc describe tabout for more information.
On the other hand, it's not clear what you mean by "splitting groups by hand". You show no code for this operation, but as long as it's general enough for your purposes (and practical) I think you should allow for it. The code might not be as elegant as you wish but if it's doing what it's supposed to, I think it's alright. For example:
clear all
set more off
set seed 8642
set obs 40
* Create example data
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
* Data management
gen cash0 = cash if flag == 0
gen cash1 = cash if flag == 1
* Table
estpost tabstat cash*, stat(mean) by(year)
esttab, cells("cash0 cash1")
can be used for a table like the one you give in your original post. It's true you have two extra lines and variables, but they may be harmless. I agree with the idea that in general, efficiency is something you worry about once your program is behaving appropriately; unless of course, the lack of it prevents you from reaching that state.

Emulate Stata 8 clustered bootstrapped regression

I'm trying to store a series of scalars along the coefficients of a bootstrapped regression model. The code below looks like the example from the Stata [P]rogramming manual for postfile, which is apparently intended for use with such procedures.
The problem is with the // commented lines, which fail to work. More specifically, the problem seems to be that the syntax below worked in Stata 8 but fails to work in Stata 9+ after some change in the bootstrap procedure.
cap pr drop bsreg
pr de bsreg
reg mpg weight gear_ratio
predict yhat
qui sum yhat
// sca mu = r(mean)
// post sim (mu)
end
sysuse auto, clear
postfile sim mu using results , replace
bootstrap, cluster(foreign) reps(5) seed(6112): bsreg
postclose sim
use results, clear
Adding version 8 to the code did not solve the issue. Would anyone know what is wrong with this procedure, and how to fix it for execution in Stata 9+? The problem has been raised in the past and more recently, but without finding an answer.
Sorry for the long description, it's a long problem.
I've presented the issue as if it's a programming one because I'm using this code to replicate some health inequalities research. It's necessary to bootstrap the entire procedure, not just the reg model. I have some quibbles with the methodology, but nothing that would stop me from replicating the analysis.
Adding noisily to the bootstrap showed a problem with the predict command. Here's a fix using a tempvar macro.
cap pr drop bsreg
pr de bsreg
reg mpg weight gear_ratio
tempvar yhat
predict `yhat'
qui sum `yhat'
sca mu = r(mean)
post sim (mu)
end
sysuse auto, clear
postfile sim mu using results , replace
bootstrap, cluster(foreign) reps(5) seed(6112): bsreg
postclose sim
use results, clear