Simulating AR(1) in Stata from last observation to the first - stata

I want to simulate an AR(1) process, but start from the end. But my code does not work as expected:
clear
set obs 100
gen et=rnormal(0,1)
quietly gen yt= et in L
quietly replace yt=0.5*yt[_n+1]+et in 1/L-1
Your help is really appreciated.

Just do it the normal way and then reverse order:
clear
set obs 100
gen obs = -_n
gen et=rnormal(0,1)
quietly gen yt = et in 1
quietly replace yt = 0.5*yt[_n-1] + et in 2/L
sort obs
The key is that Stata works in order of the observations. So, this code works as you would want in cascade, value for observation 2 depending on observation 1, 3 on 2, and so forth.
You won't get a cascade going the other direction.
Also, set seed for reproducibility.

Related

Is there a way to flip the order of observations in Stata?

Is it possible to create a backwards counting variable in Stata (like the command _n, just numbering observations backwards)? Or a command to flip the data set, so that the observation with the most recent date is the first one? I would like to make a scatter plot with AfD on the y-axis and the date (row_id) on the x-axis. When I make the plot however, the weeks are ordered backwards. How can I change the order?
This is the code:
generate row_id=_n
twoway scatter AfD row_id || lfit AfD row_id
Here are the data set and the plot:
Your date variable is a string variable, which is unlikely to get you the desired result if you sort on that variable.
You can create a Stata internal form date variable from your string variable:
gen date_num = daily(date, "MDY")
format date_num %td
The values of this new variable will represent the number of days since 1 Jan 1960.
If you create a scatter plot with this date variable on the x-axis, by default it will be sorted from min to max. To let it run from max to min you can specify option xscale(reverse).
If you still want to create an id variable by yourself you can choose one of these options (ascending and descending):
sort date_num
gen id = _n
gsort -date_num
gen id = _n
For your problem, plotting in terms of a daily date variable and -- if for some reason that is a good idea -- using xscale(reverse) are likely to be what you need, as well explained by #Wouter.
In general something like
gen long newid = _N - _n + 1
sort newid
will reverse a dataset.

Looping with distance matching

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

Find _n of Observation(s) that have a certain value

I want to find the observation numbers that correspond to the observations that have a particular value, say 29. I would then like to save these observation numbers in a macro.
Is there a better way to do so than the following clunky and inefficient forvalues loop?
sysuse auto, clear
local n
forvalues i=1/`=_N' {
if mpg[`i']==29 local n `n' `i'
}
display "`n'"
gen long obsno = _n
levelsof obsno if mpg == 29
is less typing for you. Why do you want this?

Stata: DFL Decomposition and Bootstrapping with Complex Survey design

Hello I am trying to do a DFL style reweighting with bootstrap weights and SEs. I have a 2 stage stratified sample over 5 rounds (repeated cross section).
The idea is to create counterfactual weights for the reference population and then find the difference in mean outcomes for the two groups. This difference can be divided into three parts
Total difference (group 1 - group 2 , both using survey weights)
Explained difference (group 2 using counterfactual weights- group 2 using survey weights)
Unexplained difference (group 1 using survey weights- group 2 using counterfactual weights)
I have written the following program for the same
Code:
///to make sure there is no singleton strata
egen cluster_id= group(sector state region strat fsu)
egen stratum_id= group(sector state region strat)
foreach r in 1 2 3 4 5 {
preserve
qui keep if round==`r'
qui svyset cluster_id [pw=hhwt] , strata (startum_id)
qui unique cluster_id, by (startum_id) gen (dup)
qui by startum_id, sort: egen temp= total(dup)
count if temp==1
drop if temp==1
drop temp dup
save "C:\Users\Round 2 Data\bs_round`r'", replace
restore
}
Code:
///final data we will use
use "C:\Users\Round 2 Data\bs_round1"
foreach r in 2 3 4 5 {
qui merge m:m round using "C:\Users\Round 2 Data\bs_round`r'"
drop _merge
sort round
tab round
}
save "C:\Users\Round 2 Data\bs_all"
Code:
///constructing bootstrap weights
egen pooled_cid= group (cluster_id round)
egen pooled_sid= group (stratum_id round)
svyset pooled_cid [pw=hhwt], strata( pooled_sid)
bsweights bsw, reps(100) n(-1)
svyset pooled_cid [pw=hhwt], strata( pooled_sid) bsrweight(bsw*) vce(bootstrap)
Code:
///writing the program
#delimit ;
capture program drop mydfl;
program define mydfl, eclass properties (svyb);
version 13;
args wgtname xvars outcome;
gen groupref=(group==1);
egen countg1=sum(group==1);
egen countg2=sum(group==2);
logit groupref `xvars';
predict phatref;
gen `wgtname'2=(phatref/(1-phatref))*(countg2/countg1) if group==2;
replace `wgtname'2=1 if group==1;
gen `wgtname'1=((1-phatref)/phatref)*(countg1/countg2) if group==1;
replace `wgtname'1=1 if group==2;
drop phatref groupref countg*;
forvalues i=1/2 {;
sum `wgtname'`i' if group==`i';
replace `wgtname'`i' = `wgtname'`i' / r(mean) if group==`i';
};
mean `outcome' if group==1 ;
mat diff_1=e(b) ;
mean `outcome' if group==2 ;
mat diff_2=e(b) ;
mean `outcome' if group==2 [pw=`wgtname'2] ;
mat diff_3=e(b) ;
mat dd_t = diff_1-diff_2 ;
mat dd_e= diff_3-diff_2 ;
mat dd_u= diff_1-diff_3 ;
ereturn scalar dd_tot=e1(dd_t,1,1) ;
ereturn scalar dd_exp=e1(dd_e,1,1) ;
ereturn scalar dd_unex=e1(dd_u,1,1) ;
end;
Code:
///running the program
local xvars age i.state fhead yrs_ed marital rural
local outcome wage
svy bootstrap e(dd_tot) e(dd_exp) e(dd_unex): mydfl wtid "`xvars'" `outcome'
I want to find the standard error for the mean gap, mean explained gap and mean unexplained gap in outcome-in this case wage of the two groups.
I keep getting the following error (after the program creates wtid1 and wtid2)
Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 50
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 100
insufficient observations to compute bootstrap standard errors
no results will be saved
What am I doing wrong?
Also posted on http://www.statalist.org/forums/forum/general-stata-discussion/general/1309830-dfl-decomposition-and-bootstrapping-with-complex-survey-design
A certain cause of bootstrap failure is that the program creates permanent variables.
Here is the first generate statement:
gen groupref=(group==1);
bootstrap first runs the program on the entire data set, and the variable groupref is added. Next, the first bootstrap replica is drawn, and the program is run on that replicate. The generate statement will now silently fail because the variable already exists. The entire program will therefore fail and the only indication will be the "X" in the Stata results.
The solution is to designate all variables created by generate, egen, or predict as temporary variables. These will be dropped after each replicate is analyzed. Here is the usage:
tempvar groupref;
gen `groupref' = (group==1);
tempvar is a local macro and can take a list of names. Similar macros are tempname and tempfile.

How to efficiently create lag variable using Stata

I have panel data (time: date, name: ticker). I want to create 10 lags for variables x and y. Now I create each lag variable one by one using the following code:
by ticker: gen lag1 = x[_n-1]
However, this looks messy.
Can anyone tell me how can I create lag variables more efficiently, please?
Shall I use a loop or does Stata have a more efficient way of handling this kind of problem?
#Robert has shown you the streamlined way of doing it. For completion, here is the "traditional", boring way:
clear
set more off
*----- example data -----
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
list, sepby(id)
*----- what you want -----
// "traditional" loop
forvalues i = 1/10 {
gen x_`i' = L`i'.x
gen y_`i' = L`i'.y
}
list, sepby(id)
And a combination:
// a combination
foreach v in x y {
tsrevar L(1/10).`v'
rename (`r(varlist)') `v'_#, addnumber
}
If the purpose is to create lagged variables to use them in some estimation, know you can use time-series operators within many estimation commands, directly; that is, no need to create the lagged variables in the first place. See help tsvarlist.
You can loop to do this but you can also take advantage of tsrevar to generate temporary lagged variables. If you need permanent variables, you can use rename group to rename them.
clear
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
tsrevar L(1/10).y
rename (`r(varlist)') y_#, addnumber
Note that if you are doing this to calculate a statistic on a rolling window, check out tsegen (from SSC)