Stata, make a variable based on the relative position to other observations

Stata, make a variable based on the relative position to other observations - stata

I am performing an event study, see reproducible example below. I only include one unit but this is enough for the question I'm asking.
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
end
I generate dif_year which should take the difference of years to the treatment:
sort unit year
bysort unit: gen year_nb = _n
bysort unit: gen year_target = year_nb if treatment == 1
by unit: egen target_distance = min(year_target)
drop year_target
gen dif_year = year_nb - target_distance
drop year_nb target_distance
It works well with one treatment by unit, but here I have two. Using the code snippet from above, I get the following result:
unit
year
treatment
dif_year
1
2000
0
-2
1
2001
0
-1
1
2002
1
0
1
2003
0
1
1
2004
0
2
1
2005
1
3
1
2006
0
4
1
2007
0
5
You can see that it is anchored to the first treatment (2002) but ignores the second one (2005). How can I adapt dif_year to make it work with multiple treatments (here, in 2005) ? The values for 2003 and before are correct, but I would expect to get the value -1 for 2004, 0 for 2005, -1 for 2006 and -2 for 2007.

This solution uses no loops. Evidently the problem hinges on looking backwards as well as forwards; hence reversing time temporarily is a device that can be used.
clear
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
end
bysort unit (year) : gen wanted1 = 0 if treatment
by unit: replace wanted1 = wanted1[_n-1] + 1 if missing(wanted1)
gen negyear = -year
bysort unit (negyear) : gen wanted2 = 0 if treatment
by unit: replace wanted2 = wanted2[_n-1] + 1 if missing(wanted2)
gen wanted = cond(abs(wanted2) < abs(wanted1), - wanted2, wanted1)
sort unit year
list , sep(0)
+---------------------------------------------------------------+
| unit year treatm~t wanted1 negyear wanted2 wanted |
|---------------------------------------------------------------|
1. | 1 2000 0 . -2000 2 -2 |
2. | 1 2001 0 . -2001 1 -1 |
3. | 1 2002 1 0 -2002 0 0 |
4. | 1 2003 0 1 -2003 2 1 |
5. | 1 2004 0 2 -2004 1 -1 |
6. | 1 2005 1 0 -2005 0 0 |
7. | 1 2006 0 1 -2006 . 1 |
8. | 1 2007 0 2 -2007 . 2 |
+---------------------------------------------------------------+

Here is a solution where the largest number of years does not need to be hardcoded.
clear
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
1 2008 0
1 2009 0
1 2010 1
end
sort unit year
*Set all treatment years to 0
gen diff_year = 0 if treatment == 1
*Initilize locals used in the loop
local stop "false"
local diff_distance = 0
while "`stop'" == "false" {
**Replace diff to one more than diff on row above if unit is the same,
* no diff for this row, and diff on row above is the diff distance
* for this iteration of the loop.
replace diff_year = diff_year[_n-1] + 1 if unit == unit[_n-1] & missing(diff_year) & diff_year[_n-1] == `diff_distance'
**Replace diff to one less than diff on row below if unit is the same,
* no diff for this row, and diff on row above is the diff distance
* for this iteration of the loop.
replace diff_year = diff_year[_n+1] - 1 if unit == unit[_n+1] & missing(diff_year) & diff_year[_n+1] == `diff_distance' * -1
*Test if there are still missing values, and if so set stop local to true
count if missing(diff_year)
if `r(N)' == 0 local stop "true"
*Increment the diff distance by one for next loop
local diff_distance = `diff_distance' + 1
}

I found a quick fix to my own question.
I generate a variable that takes missing values if there is no treatment. I then loop over rows, replacing the row below and above each treatment year by its value, until there isn't any remaining missing values.
Here, three iterations are enough but I set the loop until i = 10 just to show that adding more loops doesn't change the outcome.
sort unit year
bysort unit: gen year_nb = _n
bysort unit: gen year_target = year_nb if treatment == 1
gen closest_treatment = year_target
forvalues i = 1(1)10 {
bysort unit: replace closest_treatment = closest_treatment[_n-`i'] if(year_target[_n-`i'] != . & closest_treatment[_n] == .)
bysort unit: replace closest_treatment = closest_treatment[_n+`i'] if(year_target[_n+`i'] != . & closest_treatment[_n] == .)
}
replace year_target = closest_treatment if year_target == .
drop closest_treatment
gen dif_year = year_nb - year_target
drop year_nb year_target
Edit: in my example, the number of rows between the two treatments is even. But this solution also works for odd values, as the last row to be iterated over would be exactly in between two treatments. It doesn't matter whether we assign the distance to the previous or next treatment, unless you are interested in the sign of the number, which I assume you want to take into consideration while doing event studies (e.g. if the distance to previous treatment would be +3 years, the distance to the next treatment would be -3). This code snippet assigns value to the previous treatment (positive sign). If you want the opposite, just swap the two lines inside the loop.

Related

How to Count Distinct for SAS PROC SQL with Rolling Date Window of 5 years?

I want to count the distinct values of a variable grouped by MEMBER_ID and a rolling date range of 5 years. I have seen a similar post.
How to Count Distinct for SAS PROC SQL with Rolling Date Window?
When I change h2.DATE BETWEEN h.DATE - 180 AND h.DATE to h2.year BETWEEN h.year-5 AND h.year, should it give me the correct distinct count within the last 5 years? Thank you in advance.
data have;
input permno year Cand_ID$;
datalines;
1 2000 1
1 2001 2
1 2002 3
1 2003 1
1 2004 3
1 2005 1
2 2000 1
2 2001 3
2 2002 1
2 2003 2
2 2004 2
2 2005 2
2 2006 1
2 2007 1
3 2001 3
3 2002 3
3 2003 3
3 2004 1
3 2005 1
;
run;

Here's how you can do it with a data step. This assumes you have values for all years. If you do not, fill it in with zeros.
Keep a rolling list of the last 5 years by using the lag function. If we keep a rolling sorted array list of the last 5 years using lag, we can count the distinct values for each row to get a rolling 5-year count.
In other words, we're going to create and count a list that looks like this:
permno year id1 id2 id3 id4 id5
1 2000 . . . . 1
1 2001 . . . 1 2
1 2002 . . 1 2 3
1 2003 . 1 1 2 3
Code:
data want;
set have;
by permno year;
array lagid[4] $;
array id[5] $;
id1 = cand_id;
lagid1 = lag1(cand_id);
lagid2 = lag2(cand_id);
lagid3 = lag3(cand_id);
lagid4 = lag4(cand_id);
/* Reset the counter for the first group */
if(first.permno) then n = 0;
/* Count the number of rows within a group */
n+1;
/* Save the last 5 years by using the lag function,
but do not get lags from previous groups
*/
do i = 1 to 4;
if(i < n) then id[i+1] = lagid[i];
end;
/* Sort the array of IDs into ascending order */
call sortc(of id:);
/* Count the number of distinct IDs in the array. Do not count
missing values.
*/
n_distinct = 1;
do i = 2 to dim(id);
if(id[i] > id[i-1] AND NOT missing(id[i-1]) ) then n_distinct+1;
end;
drop lag: n i;
run;
Output (without id: dropped):
permno year Cand_ID id1 id2 id3 id4 id5 n_distinct
1 2000 1 . . . . 1 1
1 2001 2 . . . 1 2 2
1 2002 3 . . 1 2 3 3
1 2003 1 . 1 1 2 3 3
1 2004 3 1 1 2 3 3 3
1 2005 1 1 1 2 3 3 3

Summarize which event came first

I have panel data of individuals, their marital status (0 = not married, 1 = married) and one random shock (0 = No shock, 1 = Shock). Now for the people who experience the shock (Everyone except id1), I would like to know which person was already married when they experienced the shock (n=2, id3, id5), who was not married when they experienced the shock but subsequently got married (n=1, id2) and who was not married when they experienced the shock and did not get married subsequently (n=1, id4).
* Example generated by -dataex-. For more info, type help dataex
clear
input int year str3 id float(shock maritalstatus)
2010 "id1" 0 1
2011 "id1" 0 1
2012 "id1" 0 1
2013 "id1" 0 0
2014 "id1" 0 0
2015 "id1" 0 0
2010 "id2" 1 0
2011 "id2" 0 1
2012 "id2" 0 1
2013 "id2" 0 1
2014 "id2" 0 1
2015 "id2" 0 1
2010 "id3" 0 1
2011 "id3" 0 1
2012 "id3" 0 1
2013 "id3" 1 1
2014 "id3" 0 1
2015 "id3" 0 1
2010 "id4" 1 0
2011 "id4" 0 0
2012 "id4" 0 0
2013 "id4" 0 0
2014 "id4" 0 0
2015 "id4" 0 0
2010 "id5" 0 1
2011 "id5" 0 1
2012 "id5" 1 1
2013 "id5" 0 1
2014 "id5" 0 1
2015 "id5" 0 1
end

Thanks for the data example.
Being married when the shock arrived is identifiable by looking at each observation, but the trick lies in spreading that to all observations for the same identifier.
egen married_at_shock = total(marital == 1 & shock == 1), by(id)
The next variable is a variation on the same theme.
egen not_married_at_shock = total(marital == 0 & shock == 1), by(id)
The last variable seems harder to me. I think you have to work out explicitly when the shock occurred
egen when_shock = mean(cond(shock == 1, year, .)), by(id)
and then check what happened afterwards
egen never_married_after_shock = total(marital & year > when_shock), by(id)
replace never_married_after_shock = never_married == 0 if when_shock < .
tabdisp id, c(*married*)
----------------------------------------------------------------------------
id | married_at_shock not_married_at_shock never_married_afte~k
----------+-----------------------------------------------------------------
id1 | 0 0 0
id2 | 0 1 0
id3 | 1 0 0
id4 | 0 1 1
id5 | 1 0 0
----------------------------------------------------------------------------
There are no doubt other ways to approach this.
Any reading list starts with underlining that true and false conditions yield 1 and 0 respectively
as discussed in this FAQ
which has many applications
such as applications to "any" and "all" questions, which include "ever" and "never"
The use of egen as a workhorse here is natural given your need to work both on observations for each identifier and over each history. Some tricks are covered in
this paper.

Stata: sum of variable given other variable conditions

I'm trying to generate different 'total count' variables by companyid & year.
One 'total count' for subs, and one total count for loans.
Basically I'm trying to extend this question: Stata: Calculate sum of any x in y?
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 companyid int year float sub_num double sub_amt float(sub_year_total loan_num) double loan_amt float loan_year_total
"001004" 1999 . 0 425000 . 0 0
"001004" 1999 2 425000 425000 . 0 0
"001004" 2004 . 0 0 . 0 0
"001004" 2005 1 4232000 4232000 . 0 0
"001004" 2006 1 16000000 1.60e+07 . 0 0
"001004" 2007 3 58354 182444 . 0 0
"001078" 2006 . 0 471529 . 0 0
"001078" 2006 . 0 471529 . 0 0
"001078" 2006 . 0 471529 . 0 0
"001078" 2006 6 29872 471529 . 0 0
"001078" 2006 6 59748 471529 . 0 0
"001078" 2006 6 381909 471529 . 0 0
"001078" 2007 . 0 768825 7 270000 2580000
"001078" 2007 . 0 768825 7 360000 2580000
"001078" 2007 . 0 768825 7 1500000 2580000
"001078" 2007 . 0 768825 7 450000 2580000
"001078" 2007 . 0 768825 . 0 2580000
"001078" 2007 7 359454 768825 . 0 2580000
"001078" 2007 7 409371 768825 . 0 2580000
"001078" 2008 . 0 1751832 5 450000 2450000
"001078" 2008 . 0 1751832 5 2000000 2450000
"001078" 2008 5 47957 1751832 . 0 2450000
"001078" 2008 5 485631 1751832 . 0 2450000
"001078" 2008 5 1218244 1751832 . 0 2450000
end
To note: If sub_num = 0 then loan_num != 0, and vice versa.
I've tried bysort cik year: gen sub_num = _N if loan_amt != 0
and bysort cik year loan_amt: gen sub_num = _N but neither really does it. I've left my failed count variables in the examples for reference.
i.e. company #001078 in 2007 would have loan_num = 4 and sub_num = 2
I just noticed this example has one observation that has 0 for both, I can just eliminate entries that have 0 for both so no need to comment on that.
How can I make company total annual counts for my 'sub' and 'loan' variables?

This is a little hard to follow.
There is reference to cik in your code but it is not in your data example.
It is hard to know what is original data and what is the result of calculations you have tried.
The example seems more complicated than necessary.
Although the title refers to sums, it is also clear that you are interested in counting loans of certain kinds.
A count is a sum of indicators, so this shows some technique rather than necessarily being an answer. Feed to egen, total() a true-or-false expression and the result will be the count of observations for which the expression is true (1); arguments that are false (0) are ignored in the sense that they make no difference to the sum.
bysort companyid year : egen wanted1 = total(loan_amt > 0)
bysort companyid year : egen wanted2 = total(loan_amt > 0 & sub_num < .)
_N is just the number of observations, sometimes conditional on other variables. You naturally can assign that number to a variable, but also specifying an if qualifier doesn't make the calculation ignore the excluded values; it just affects which observations are ignored in receiving non-missing values. Consider this experiment:
. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen count = _N if _n == 1
(999 missing values generated)
. l count in 1
+-------+
| count |
|-------|
1. | 1000 |
+-------+
Otherwise put, _N is not as general a counting method as you need here.

I think I found a work around:
gen lc = 0
replace lc = 1 if loan_sum != 0
bysort cik year lc: gen lcount = _N if lc != 0
then just do the same for other variables.

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}

This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.

Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.

Count observations within dynamic range

Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date

I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.

The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stata, make a variable based on the relative position to other observations - stata

Related

How to Count Distinct for SAS PROC SQL with Rolling Date Window of 5 years?

Summarize which event came first

Stata: sum of variable given other variable conditions

Stata: How to count the number of 'active' cases in a group when new case is opened?

Count observations within dynamic range

Categories

Resources