Stata egen combined with if - stata

I have data like this
year month X Y weight
2013 1 1 0 1000
2001 12 0 1 2000
I want to create a variable Z based on the X and Y variables, conditional on year. I have two formulas for year before and after 2002. If I use egen with if,
if year > 2002 {
bysort year month :egen Z= total( x*weight)
}
else {
bysort year month : egen Z= total(y*weight*0.5)
}
this code is not going to work, because if year <2002 , Stata would report that z has already been created. Is there any way to achieve the goal?
I used a very crude and brute force way to solve this problem. I create two variables for z, namely z and z_2002. Then I replace z with z_2002 if the year is less than 2002.

If I understand correctly, this should work.
Compute the products in a first step (conditional on the year) and the sums in a second step.
As other answers already note, there's a difference between the if qualifier and the if programming command. There's a short FAQ on this: http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/.
(I use code provided by #NickCox in a comment to another answer.)
clear all
set more off
*----- example data -----
input year month x y weight
2013 1 1 0 1000
2013 1 1 0 800
2013 2 0 1 1200
2013 2 1 0 1400
2001 12 1 0 1500
2001 12 0 1 2000
2001 11 1 1 4000
end
sort year month
list, sepby(year month)
*----- computations -----
gen Z = cond(year > 2002, x * weight, y * weight * 0.5)
bysort year month: egen totZ = total(Z) // already sorted so -by- should be enough
list, sepby(year month)

clear
input year month x y weight
2013 1 1 0 1000
2001 12 0 1 2000
end
preserve
keep if year>2002
bysort year month :egen z= total(x*weight)
tempfile t1
save `t1'
restore
keep if year<=2002
bysort year month : egen z= total(y*weight*0.5)
append using `t1'
list

Related

Stata, make a variable based on the relative position to other observations

I am performing an event study, see reproducible example below. I only include one unit but this is enough for the question I'm asking.
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
end
I generate dif_year which should take the difference of years to the treatment:
sort unit year
bysort unit: gen year_nb = _n
bysort unit: gen year_target = year_nb if treatment == 1
by unit: egen target_distance = min(year_target)
drop year_target
gen dif_year = year_nb - target_distance
drop year_nb target_distance
It works well with one treatment by unit, but here I have two. Using the code snippet from above, I get the following result:
unit
year
treatment
dif_year
1
2000
0
-2
1
2001
0
-1
1
2002
1
0
1
2003
0
1
1
2004
0
2
1
2005
1
3
1
2006
0
4
1
2007
0
5
You can see that it is anchored to the first treatment (2002) but ignores the second one (2005). How can I adapt dif_year to make it work with multiple treatments (here, in 2005) ? The values for 2003 and before are correct, but I would expect to get the value -1 for 2004, 0 for 2005, -1 for 2006 and -2 for 2007.
This solution uses no loops. Evidently the problem hinges on looking backwards as well as forwards; hence reversing time temporarily is a device that can be used.
clear
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
end
bysort unit (year) : gen wanted1 = 0 if treatment
by unit: replace wanted1 = wanted1[_n-1] + 1 if missing(wanted1)
gen negyear = -year
bysort unit (negyear) : gen wanted2 = 0 if treatment
by unit: replace wanted2 = wanted2[_n-1] + 1 if missing(wanted2)
gen wanted = cond(abs(wanted2) < abs(wanted1), - wanted2, wanted1)
sort unit year
list , sep(0)
+---------------------------------------------------------------+
| unit year treatm~t wanted1 negyear wanted2 wanted |
|---------------------------------------------------------------|
1. | 1 2000 0 . -2000 2 -2 |
2. | 1 2001 0 . -2001 1 -1 |
3. | 1 2002 1 0 -2002 0 0 |
4. | 1 2003 0 1 -2003 2 1 |
5. | 1 2004 0 2 -2004 1 -1 |
6. | 1 2005 1 0 -2005 0 0 |
7. | 1 2006 0 1 -2006 . 1 |
8. | 1 2007 0 2 -2007 . 2 |
+---------------------------------------------------------------+
Here is a solution where the largest number of years does not need to be hardcoded.
clear
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
1 2008 0
1 2009 0
1 2010 1
end
sort unit year
*Set all treatment years to 0
gen diff_year = 0 if treatment == 1
*Initilize locals used in the loop
local stop "false"
local diff_distance = 0
while "`stop'" == "false" {
**Replace diff to one more than diff on row above if unit is the same,
* no diff for this row, and diff on row above is the diff distance
* for this iteration of the loop.
replace diff_year = diff_year[_n-1] + 1 if unit == unit[_n-1] & missing(diff_year) & diff_year[_n-1] == `diff_distance'
**Replace diff to one less than diff on row below if unit is the same,
* no diff for this row, and diff on row above is the diff distance
* for this iteration of the loop.
replace diff_year = diff_year[_n+1] - 1 if unit == unit[_n+1] & missing(diff_year) & diff_year[_n+1] == `diff_distance' * -1
*Test if there are still missing values, and if so set stop local to true
count if missing(diff_year)
if `r(N)' == 0 local stop "true"
*Increment the diff distance by one for next loop
local diff_distance = `diff_distance' + 1
}
I found a quick fix to my own question.
I generate a variable that takes missing values if there is no treatment. I then loop over rows, replacing the row below and above each treatment year by its value, until there isn't any remaining missing values.
Here, three iterations are enough but I set the loop until i = 10 just to show that adding more loops doesn't change the outcome.
sort unit year
bysort unit: gen year_nb = _n
bysort unit: gen year_target = year_nb if treatment == 1
gen closest_treatment = year_target
forvalues i = 1(1)10 {
bysort unit: replace closest_treatment = closest_treatment[_n-`i'] if(year_target[_n-`i'] != . & closest_treatment[_n] == .)
bysort unit: replace closest_treatment = closest_treatment[_n+`i'] if(year_target[_n+`i'] != . & closest_treatment[_n] == .)
}
replace year_target = closest_treatment if year_target == .
drop closest_treatment
gen dif_year = year_nb - year_target
drop year_nb year_target
Edit: in my example, the number of rows between the two treatments is even. But this solution also works for odd values, as the last row to be iterated over would be exactly in between two treatments. It doesn't matter whether we assign the distance to the previous or next treatment, unless you are interested in the sign of the number, which I assume you want to take into consideration while doing event studies (e.g. if the distance to previous treatment would be +3 years, the distance to the next treatment would be -3). This code snippet assigns value to the previous treatment (positive sign). If you want the opposite, just swap the two lines inside the loop.

Computing running sum with moving time-window

My data
I am working on a spell dataset in the following format:
cls
clear all
set more off
input id spellnr str7 bdate_str str7 edate_str employed
1 1 2008m1 2008m9 1
1 2 2008m12 2009m8 0
1 3 2009m11 2010m9 1
1 4 2010m10 2011m9 0
///
2 1 2007m4 2009m12 1
2 2 2010m4 2011m4 1
2 3 2011m6 2011m8 0
end
* translate to Stata monthly dates
gen bdate = monthly(bdate_str,"YM")
gen edate = monthly(edate_str,"YM")
drop *_str
format %tm bdate edate
list, sepby(id)
Corresponding to:
+---------------------------------------------+
| id spellnr employed bdate edate |
|---------------------------------------------|
1. | 1 1 1 2008m1 2008m9 |
2. | 1 2 0 2008m12 2009m8 |
3. | 1 3 1 2009m11 2010m9 |
4. | 1 4 0 2010m10 2011m9 |
|---------------------------------------------|
5. | 2 1 1 2007m4 2009m12 |
6. | 2 2 1 2010m4 2011m4 |
7. | 2 3 0 2011m6 2011m8 |
+---------------------------------------------+
Here a given person (id) can have multiple spells (spellnr) of two types (unempl: 1 for unemployment; 0 for employment). the start-end dates of each spell are definied by bdate and edate, respectively.
Imagine the data was already cleaned, and is such that no spells overlap with each other.
There might be "missing" periods in between any two spells though.
This is captured by the dummy dataset above.
My question:
For each unemployment spell, I need to compute the number of months spent in employment in the last 6 months, 12 months, and 24 months.
Note that, importantly, each id can go in and out from employment, and all past employment spells should be taken into account (not just the last one).
In my example, this would lead to the following desired output:
+--------------------------------------------------------------+
| id spellnr employed bdate edate m6 m24 m48 |
|--------------------------------------------------------------|
1. | 1 1 1 2008m1 2008m9 . . . |
2. | 1 2 0 2008m12 2009m8 4 9 9 |
3. | 1 3 1 2009m11 2010m9 . . . |
4. | 1 4 0 2010m10 2011m9 6 11 20 |
|--------------------------------------------------------------|
5. | 2 1 1 2007m4 2009m12 . . . |
6. | 2 2 1 2010m4 2011m4 . . . |
7. | 2 3 0 2011m6 2011m8 5 20 44 |
+--------------------------------------------------------------+
My (working) attempt:
The following code returns the desired result.
* expand each spell to one observation per time unit (here "months"; works also for days)
expand edate-bdate+1
bysort id spellnr: gen spell_date = bdate + _n - 1
format %tm spell_date
list, sepby(id spellnr)
* fill-in empty months (not covered by spells)
xtset id spell_date, monthly
tsfill
* compute cumulative time spent in employment and lagged values
bysort id (spell_date): gen cum_empl = sum(employed) if employed==1
bysort id (spell_date): replace cum_empl = cum_empl[_n-1] if cum_empl==.
bysort id (spell_date): gen lag_7 = L7.cum_empl if employed==0
bysort id (spell_date): gen lag_24 = L25.cum_empl if employed==0
bysort id (spell_date): gen lag_48 = L49.cum_empl if employed==0
qui replace lag_7=0 if lag_7==. & employed==0 // fix computation for first spell of each "id" (if not enough time to go back with "L.")
qui replace lag_24=0 if lag_24==. & employed==0
qui replace lag_48=0 if lag_48==. & employed==0
* compute time spent in employment in the last 6, 24, 48 months, at the beginning of each unemployment spell
bysort id (spell_date): gen m6 = cum_empl - lag_7 if employed==0
bysort id (spell_date): gen m24 = cum_empl - lag_24 if employed==0
bysort id (spell_date): gen m48 = cum_empl - lag_48 if employed==0
qui drop if (spellnr==.)
qui bysort id spellnr (spell_date): keep if _n == 1
drop spell_date cum_empl lag_*
list
This works fine, but becomes quite inefficient when using (several millions of) daily data. Can you suggest any alternative approach that does not involve expanding the dataset?
In words what I do above is:
I expand data to have one row per month;
I fill-in the "gaps" in between the spells with -tsfill-
I Compute the running time spent in employment, and use lag operators to get the three quantities of interest.
This is in the vein of what done here, in a past question that I posted. However the working example there was unnecessarily complicated and with some mistakes.
SOLUTIONS PERFORMANCE
I tried different approaches suggested in the accepted answer below (including using joinby as suggested in an earlier version of the answer). In order to create a larger dataset I used:
expand 500000
bysort id spellnr: gen new_id = _n
drop id
rename new_id id
which creates a dataset with 500,000 id's (for a total of 3,500,000 spells).
The first solution largely dominates the ones that use joinby or rangejoin (see also the comments to the accepted answer below).
Below code might save some running time.
bys id (employed): gen tag = _n if !employed
sum tag, meanonly
local maxtag = `r(max)'
foreach i in 6 24 48 {
gen m`i' = .
forval d = 1/`maxtag' {
by id: gen x = 1 + min(bdate[`d'],edate) - max(bdate[`d']-`i',bdate) if employed
egen y = total(x*(x>0)), by(id)
replace m`i' = y if tag == `d'
drop x y
}
}
sort id bdate
The same logic, along with -rangejoin- (ssc) should also deserve a try. Please kindly provide some feedback after testing with your (large) actual data.
preserve
keep if employed
replace employed = 0
tempfile em
save `em'
restore
foreach i in 6 24 48 {
gen _bd = bdate - `i'
rangejoin edate _bd bdate using `em', by(id employed) p(_)
egen m`i' = total(_edate - max(_bd,_bdate)+1) if !employed, by(id bdate)
bys id bdate: keep if _n==1
drop _*
}

How to weigh percentages with proc tabulate?

I am creating a bunch of frequency tables using proc tabulate, and I have to weigh the percentage according to a set of weights regarding the age of each person in my dataset. My problem is that it seems like the weights have any impact on my results. I know, I can do this with proc freq, but my tables are pretty detailed, and therefore I am using proc tabulate.
I have included an example of a dataset, and what I have tried so far:
Data have;
input gender wgt q1 year;
lines;
0 1.5 0 2014
0 1 1 2014
0 1.5 1 2014
0 1 1 2014
0 1.5 0 2014
1 1 1 2014
1 1 1 2014
1 1 1 2014
1 1 0 2014
1 1 1 2014
1 1 1 2014
;
run;
Proc format;
value gender 0="boy";
1= "girl";
value q1f 0= "No"
1="Yes";
run;
Proc tabulate data=have;
class gender q1 year;
weight wgt;
table gender*pctn<q1>, year*q1;
format gender gender. q1 q1f.;
run;
I know the results should be that app. 46,2 % boys have answered "No" and app. 53,8 % have answered yes, when I include the weights, but the output from the proc tabulate gives me 40 % No and 60 % yes among the boys.
What have I done wrong?
The WEIGHT statement will affect VAR variable values, not the N count. PCT<N> is a percentage of counts. A 'FREQ' statement will affect the N count by causing internal repetition of a data point based on another variable, however FREQ does not work with fractional repetitions (values) and will round down.
From the helps
FREQ variable;
specifies a numeric variable whose value represents the frequency of the observation. If you use the FREQ statement, then the procedure assumes that each observation represents n observations, where n is the value of variable. If n is not an integer, then SAS truncates it. If n is less than 1 or is missing, then the procedure does not use that observation to calculate statistics.
The sum of the frequency variable represents the total number of observations.
WEIGHT variable;
specifies a numeric variable whose values weight the values of the analysis variables. The values of the variable do not have to be integers. PROC TABULATE responds to weight values in accordance with the following table.
Weight Value: PROC TABULATE Response
0 : Counts the observation in the total number of observations
<0 : Converts the value to zero and counts the observation in the total number of observations
. : Excludes the observation
If you want to use a weight for pctN like counts, create a unity variable that is to be weighted and PCTSUM
Data have;
input gender wgt q1 year;
unity = 1;
lines;
0 1.5 0 2014
0 1 1 2014
0 1.5 1 2014
0 1 1 2014
0 1.5 0 2014
1 1 1 2014
1 1 1 2014
1 1 1 2014
1 1 0 2014
1 1 1 2014
1 1 1 2014
;
run;
Proc tabulate data=have;
title "Unity weighted";
class gender q1 year;
format gender gender. q1 q1f.;
var unity; %* <----------;
weight wgt;
table gender*unity, year*q1; %* <---- debug, the count 'basis' for PCTSUM<q1> ;
table gender*unity*(pctsum<q1>), year*q1; %* <--- weighted unity PCTSUM;
run;

Stata - calculate average of everyone in group except current observation

I want to calculate the average of all members of the group I am in, but not include myself in the average. Suppose that the group variable is called group and I want to take the average of val1 by Group, excluding myself. The new column I wish to create is avg. The data looks as follows (with the correct values of avg inputed so you can see what I mean).
Obs Group val1 avg
1 A 6 8
2 A 8 6
3 B 10 13
4 C 4 4
5 C 2 5
6 C 6 3
7 B 12 12
8 B 14 11
If I wanted to include myself in the calculation it would be straightforward. I would just do:
bysort Group: egen avg = mean(val1)
But how do I implement this with the wrinkle that I don't include myself?
One way is looping through all observations:
clear
set more off
*----- example data -----
input ///
Obs str1 Group val1 avg
1 A 6 8
2 A 8 6
3 B 10 13
4 C 4 4
5 C 2 5
6 C 6 3
7 B 12 12
8 B 14 11
end
list, sepby(Group)
*----- what you want -----
encode Group, gen(group)
gen avg2 = .
forvalues j = 1/`=_N' {
summarize val1 if group == group[`j'] & _n != `j', meanonly
replace avg2 = r(mean) in `j'
}
list, sepby(group)
Another way is using egen functions:
<snip>
*----- what you want -----
encode Group, gen(group)
bysort group: egen totval = total(val1)
by group: egen cval = count(val1)
generate avg2 = (totval - val1) / (cval - 1)
list, sepby(group)
There is a nice article available on the web that covers this topic:
The Stata Journal (2014)
14, Number 2, pp. 432–444,
Speaking Stata: Self and others, by
Nicholas J. Cox.

Stata: add values onto existing values

year
0
1
6
....
(omit)
....
77
90
....
(omit)
....
The "year" is a numeric variable. I need to add "200" before the 1-digit values, and "19" before the 2-digit values.
year
2000
2001
2006
....
1977
1990
....
How can I do this in Stata?
Be careful: the variable might be byte and that will bite.
This should work:
gen year2 = cond(year < 10, 2000 + year, 1900 + year)
tab year2
If year2 looks good,
drop year
rename year2 year