Looping with distance matching - stata

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?

Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

Related

Finding string value associated with max value of record subset in long format

For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)
Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:
cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp)
bysort id (visitdate): gen ldl3 = ldl3max if _n==1
Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?
In this case, how can I find out what _n is for the maximum value? If I know that, I could use
bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl
Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).
cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3
bysort id (visitdate): egen maxldl3 = max(tmpldl3)
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.
You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.

Reordering panels by another variable in twoway, by() graphs

Suppose I make the following chart showing the weight of 9 pigs over time:
webuse pig
tw line weight week if inrange(id,1,9), by(id) subtitle(, nospan)
Is it possible to reorder the panels by another variable while retaining the original label? I can imagine defining another variable that is sorted the right way and then labeling it with the right id, but curious if there is a less clunky way of achieving that.
I think you are right: you need a new ordering variable. Positively, you can order on any criterion of choice. Watch out for ties on the variable used to order, which can always broken by referring to the original identifier. Here we sort on final weights, by default smallest first. (For largest first, negate the weight variable.)
webuse pig, clear
keep if id <= 9
bysort id (week) : gen last = weight[_N]
egen newid = group(last id)
bysort newid : gen toshow = strofreal(id) + " (" + strofreal(last, "%2.1f") + ")"
* search labmask for download links
labmask newid , values(toshow)
set scheme s1color
line weight week, by(newid, note("")) sort xla(1/9)
Short papers discussing the principles here are already in train for publication in the Stata Journal in 2021.

Count by groups and collapse

I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.

Stata: DFL Decomposition and Bootstrapping with Complex Survey design

Hello I am trying to do a DFL style reweighting with bootstrap weights and SEs. I have a 2 stage stratified sample over 5 rounds (repeated cross section).
The idea is to create counterfactual weights for the reference population and then find the difference in mean outcomes for the two groups. This difference can be divided into three parts
Total difference (group 1 - group 2 , both using survey weights)
Explained difference (group 2 using counterfactual weights- group 2 using survey weights)
Unexplained difference (group 1 using survey weights- group 2 using counterfactual weights)
I have written the following program for the same
Code:
///to make sure there is no singleton strata
egen cluster_id= group(sector state region strat fsu)
egen stratum_id= group(sector state region strat)
foreach r in 1 2 3 4 5 {
preserve
qui keep if round==`r'
qui svyset cluster_id [pw=hhwt] , strata (startum_id)
qui unique cluster_id, by (startum_id) gen (dup)
qui by startum_id, sort: egen temp= total(dup)
count if temp==1
drop if temp==1
drop temp dup
save "C:\Users\Round 2 Data\bs_round`r'", replace
restore
}
Code:
///final data we will use
use "C:\Users\Round 2 Data\bs_round1"
foreach r in 2 3 4 5 {
qui merge m:m round using "C:\Users\Round 2 Data\bs_round`r'"
drop _merge
sort round
tab round
}
save "C:\Users\Round 2 Data\bs_all"
Code:
///constructing bootstrap weights
egen pooled_cid= group (cluster_id round)
egen pooled_sid= group (stratum_id round)
svyset pooled_cid [pw=hhwt], strata( pooled_sid)
bsweights bsw, reps(100) n(-1)
svyset pooled_cid [pw=hhwt], strata( pooled_sid) bsrweight(bsw*) vce(bootstrap)
Code:
///writing the program
#delimit ;
capture program drop mydfl;
program define mydfl, eclass properties (svyb);
version 13;
args wgtname xvars outcome;
gen groupref=(group==1);
egen countg1=sum(group==1);
egen countg2=sum(group==2);
logit groupref `xvars';
predict phatref;
gen `wgtname'2=(phatref/(1-phatref))*(countg2/countg1) if group==2;
replace `wgtname'2=1 if group==1;
gen `wgtname'1=((1-phatref)/phatref)*(countg1/countg2) if group==1;
replace `wgtname'1=1 if group==2;
drop phatref groupref countg*;
forvalues i=1/2 {;
sum `wgtname'`i' if group==`i';
replace `wgtname'`i' = `wgtname'`i' / r(mean) if group==`i';
};
mean `outcome' if group==1 ;
mat diff_1=e(b) ;
mean `outcome' if group==2 ;
mat diff_2=e(b) ;
mean `outcome' if group==2 [pw=`wgtname'2] ;
mat diff_3=e(b) ;
mat dd_t = diff_1-diff_2 ;
mat dd_e= diff_3-diff_2 ;
mat dd_u= diff_1-diff_3 ;
ereturn scalar dd_tot=e1(dd_t,1,1) ;
ereturn scalar dd_exp=e1(dd_e,1,1) ;
ereturn scalar dd_unex=e1(dd_u,1,1) ;
end;
Code:
///running the program
local xvars age i.state fhead yrs_ed marital rural
local outcome wage
svy bootstrap e(dd_tot) e(dd_exp) e(dd_unex): mydfl wtid "`xvars'" `outcome'
I want to find the standard error for the mean gap, mean explained gap and mean unexplained gap in outcome-in this case wage of the two groups.
I keep getting the following error (after the program creates wtid1 and wtid2)
Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 50
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 100
insufficient observations to compute bootstrap standard errors
no results will be saved
What am I doing wrong?
Also posted on http://www.statalist.org/forums/forum/general-stata-discussion/general/1309830-dfl-decomposition-and-bootstrapping-with-complex-survey-design
A certain cause of bootstrap failure is that the program creates permanent variables.
Here is the first generate statement:
gen groupref=(group==1);
bootstrap first runs the program on the entire data set, and the variable groupref is added. Next, the first bootstrap replica is drawn, and the program is run on that replicate. The generate statement will now silently fail because the variable already exists. The entire program will therefore fail and the only indication will be the "X" in the Stata results.
The solution is to designate all variables created by generate, egen, or predict as temporary variables. These will be dropped after each replicate is analyzed. Here is the usage:
tempvar groupref;
gen `groupref' = (group==1);
tempvar is a local macro and can take a list of names. Similar macros are tempname and tempfile.

Count number of living firms efficiently

I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.
That said, maybe someone can suggest an overall better strategy.
Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to #Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.