Stata: How to count the number of 'active' cases in a group when new case is opened? - stata

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}

This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.

Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

Combine overlapping categorical variables

I am trying to "combine" two categorical variables in Stata (say var1 and var2) into a new (also categorical) variable (say res).
The example below illustrates what I am trying to achieve:
var1 var2 res
1 1 A
1 2 A
2 1 A
3 3 B
4 2 A
5 4 D
What this example does is to combine all categories of var1 and var2 that "overlap".
Here, the pair var1 == 1 and var2 == 1 initially form a group (res== A). All other pairs containing var1 == 1 or var2 == 1 should belong to the same group (hence res== A in rows 2 and 3). Because in row 2 we have var2==2, any pair with containing var2==2 should belong to the same group. That's why in row 4 res== A.
Another way to look at this problem is using the following matrix:
| 1 2 3 4
-----------------------
1 | 1 1
2 | 1
3 | 1
4 | 1
5 | 1
Because the element [1,1] is not empty (or zero), all elements in row 1 and column 1 must belong to the same group. Because [1,2] is not empty, the same is true for row 1, column 2. And so on and so forth. It does not matter which row/column you decide to start from.
egen group alone doesn't cut it.
Any ideas?
Sounds like you want to further group var1 if the values of var2 are the same. If that's the case, then you can use a program I wrote called group_id that's available from SSC. To install it, type in Stata's Command window:
ssc install group_id
Here's an example of how you would use it:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(var1 var2) str1 res
1 1 "A"
1 2 "A"
2 1 "A"
3 3 "B"
4 2 "A"
5 4 "D"
end
gen long wanted = var1
group_id wanted, matchby(var2)
list, sep(0)
and the results:
. list, sep(0)
+----------------------------+
| var1 var2 res wanted |
|----------------------------|
1. | 1 1 A 1 |
2. | 1 2 A 1 |
3. | 2 1 A 1 |
4. | 3 3 B 3 |
5. | 4 2 A 1 |
6. | 5 4 D 5 |
+----------------------------+

Populating new variable using vlookup with multiple criteria in another variable

1) A new variable should be created for each unique observation listed in variable sku, which contains repeated values.
2) These newly created variables should be assigned the value of own product's price at the store/week level, as long as observations' sku value is in the same subcategory (subc) as the variable itself. For example, in eta2,3, observations in line 3, 4, and 5 have the same value because they all belong to the same subcategory as sku #3. [eta2,3 indicates sku 3, subc 2.]
3) x indicates that this is the original value for the product/subcategory that is currently being replicated.
4) If an observation doesn't belong to the same subcategory, it should reflect "0".
Orange is the given data. In green are the values from the steps 1, 2, and 3. White cells are step 4.
I am unable to offer a solution of my own, as searching for a
way to generate a variable using existing observations hasn't given me results.
I also understand that it must be a combination of forvalues, foreach, and levelsof commands?
clear
input units price sku week store subc
3 4.3 1 1 1 1
2 3 2 1 1 1
1 2.5 3 1 1 2
4 12 5 1 1 2
5 12 6 1 1 3
35 4.3 1 1 2 1
23 3 2 1 2 1
12 2.5 3 1 2 2
35 12 5 1 2 2
35 12 6 1 2 3
3 20 1 2 1 1
2 30 2 2 1 1
4 40 3 2 2 2
1 50 4 2 2 2
9 10 5 2 2 2
2 90 6 2 2 3
end
UPDATE
Based on Nick Cox' feedback, this is the final code that gives the result I have been looking for:
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
egen joint = group(subc sku), label
bysort store week : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
tostring subc sku, replace
gen new = subc + "_"+sku
su joint, meanonly
forval j = 1/`r(max)'{
local J = new[`j']
gen eta`J' = .
}
sort subc week store sku
egen joint1 = group(subc week store), label
gen long id = _n
su joint1, meanonly
quietly forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
forval j = `jmin'/`jmax' {
local subc = subc[`j']
local sku = sku[`j']
replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax'
replace eta`subc'_`sku' = 0 in `j'/`j'
}
}
I worry on your behalf that in a dataset of any size what you ask for would mean many, many extra variables. I wonder on your behalf whether you need all of them any way for whatever you want to do with them.
That aside, this seems to be what you want. Naturally your column headers in your spreadsheet view aren't legal variable names. Disclosure: despite being the original author of levelsof I wouldn't prefer its use here.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
end
sort subc sku
* subc identifiers guaranteed to be integers 1 up
egen subc_id = group(subc), label
* observation numbers in a variable
gen long id = _n
* how many subc? loop over the range
su subc_id, meanonly
forval i = 1/`r(max)' {
* which subc is this one? look it up using -summarize-
* assuming that subc is numeric!
su subc if subc_id == `i', meanonly
local I = r(min)
* which observation numbers for this subc?
* given the prior sort, they are all contiguous
su id if subc_id == `i', meanonly
* for each observation in the subc, find out the sku and copy its price
* to all observations in that subc
forval j = `r(min)'/`r(max)' {
local J = sku[`j']
gen eta_`I'_`J' = cond(subc_id == `i', price[`j'], 0)
}
}
list subc eta*, sepby(subc)
+------------------------------------------------------------------+
| subc eta_1_1 eta_1_2 eta_2_3 eta_2_4 eta_2_5 eta_3_6 |
|------------------------------------------------------------------|
1. | 1 4.3 3 0 0 0 0 |
2. | 1 4.3 3 0 0 0 0 |
|------------------------------------------------------------------|
3. | 2 0 0 2.5 1 12 0 |
4. | 2 0 0 2.5 1 12 0 |
5. | 2 0 0 2.5 1 12 0 |
|------------------------------------------------------------------|
6. | 3 0 0 0 0 0 12 |
+------------------------------------------------------------------+
Notes:
N1. In your example, subc is numbered 1, 2, etc. My extra variable subc_id ensures that to be true even if in your real data the identifiers are not so clean.
N2. The expression
cond(subc_id == `i', price[`j'], 0)
could also be
(subc_id == `i') * price[`j']
N3. It seems possible that a different data structure would be much more efficient.
EDIT: Here is code and results for another data structure.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
end
sort subc sku
egen subc_id = group(subc), label
bysort subc : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
forval j = 1/`jmax' {
gen eta`j' = .
gen which`j' = .
}
gen long id = _n
su subc_id, meanonly
quietly forval i = 1/`r(max)' {
su id if subc_id == `i', meanonly
local jmin = r(min)
local jmax = r(max)
local k = 1
forval j = `jmin'/`jmax' {
replace which`k' = sku[`j'] in `jmin'/`jmax'
replace eta`k' = price[`j'] in `jmin'/`jmax'
local ++k
}
}
list subc sku *1 *2 *3 , sepby(subc)
+------------------------------------------------------------+
| subc sku eta1 which1 eta2 which2 eta3 which3 |
|------------------------------------------------------------|
1. | 1 1 4.3 1 3 2 . . |
2. | 1 2 4.3 1 3 2 . . |
|------------------------------------------------------------|
3. | 2 3 2.5 3 1 4 12 5 |
4. | 2 4 2.5 3 1 4 12 5 |
5. | 2 5 2.5 3 1 4 12 5 |
|------------------------------------------------------------|
6. | 3 6 12 6 . . . . |
+------------------------------------------------------------+
I am adding another answer that tackles combinations of subc and week. Previous discussion establishes that what you are trying to do would add an extra variable for every observation. This can't be a good idea! At best, you might just have many new variables, mostly zeros. At worst, you will run into Stata's limits.
Hence I won't support your endeavour to go further down the same road, but show how the second data structure I discuss in my previous answer can be produced. Indeed, you haven't indicated (a) why you want all these variables, which are just the existing data redistributed; (b) what your strategy is for dealing with them; (c) why rangestat (SSC) or some other program could not remove the need to create them in the first place.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
sort subc week sku
egen joint = group(subc week), label
bysort joint : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
forval j = 1/`jmax' {
gen eta`j' = .
gen which`j' = .
}
gen long id = _n
su joint, meanonly
quietly forval i = 1/`r(max)' {
su id if joint == `i', meanonly
local jmin = r(min)
local jmax = r(max)
local k = 1
forval j = `jmin'/`jmax' {
replace which`k' = sku[`j'] in `jmin'/`jmax'
replace eta`k' = price[`j'] in `jmin'/`jmax'
local ++k
}
}
list subc week sku *1 *2 *3 , sepby(subc week)
+-------------------------------------------------------------------+
| subc week sku eta1 which1 eta2 which2 eta3 which3 |
|-------------------------------------------------------------------|
1. | 1 1 1 4.3 1 3 2 . . |
2. | 1 1 2 4.3 1 3 2 . . |
|-------------------------------------------------------------------|
3. | 1 2 1 5.3 1 4 2 . . |
4. | 1 2 2 5.3 1 4 2 . . |
|-------------------------------------------------------------------|
5. | 2 1 3 2.5 3 1 4 12 5 |
6. | 2 1 4 2.5 3 1 4 12 5 |
7. | 2 1 5 2.5 3 1 4 12 5 |
|-------------------------------------------------------------------|
8. | 2 2 3 3.5 3 2 4 13 5 |
9. | 2 2 4 3.5 3 2 4 13 5 |
10. | 2 2 5 3.5 3 2 4 13 5 |
|-------------------------------------------------------------------|
11. | 3 1 6 12 6 . . . . |
|-------------------------------------------------------------------|
12. | 3 2 6 13 6 . . . . |
+-------------------------------------------------------------------+
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
egen joint = group(subc sku), label
bysort store week : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
tostring subc sku, replace
gen new = subc + "_"+sku
su joint, meanonly
forval j = 1/`r(max)'{
local J = new[`j']
gen eta`J' = .
}
sort subc week store sku
egen joint1 = group(subc week store), label
gen long id = _n
su joint1, meanonly
quietly forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
forval j = `jmin'/`jmax' {
local subc = subc[`j']
local sku = sku[`j']
replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax'
replace eta`subc'_`sku' = 0 in `j'/`j'
}
}
list subc sku store week eta*, sepby(subc)
+---------------------------------------------------------------------------------+
| store week subc sku eta1_1 eta1_2 eta2_3 eta2_4 eta2_5 eta3_6 |
|---------------------------------------------------------------------------------|
1. | 1 1 1 2 4.3 0 . . . . |
2. | 1 1 1 1 0 3 . . . . |
|---------------------------------------------------------------------------------|
3. | 1 1 2 4 . . 2.5 0 12 . |
4. | 1 1 2 3 . . 0 1 12 . |
5. | 1 1 2 5 . . 2.5 1 0 . |
|---------------------------------------------------------------------------------|
6. | 1 1 3 6 . . . . . 0 |
|---------------------------------------------------------------------------------|
7. | 1 2 1 2 5.3 0 . . . . |
8. | 1 2 1 1 0 4 . . . . |
|---------------------------------------------------------------------------------|
9. | 1 2 2 3 . . 0 2 13 . |
10. | 1 2 2 5 . . 3.5 2 0 . |
11. | 1 2 2 4 . . 3.5 0 13 . |
|---------------------------------------------------------------------------------|
12. | 1 2 3 6 . . . . . 0 |
+---------------------------------------------------------------------------------+

Creating complete data in stata

I have the following purchasing data
clear
input id productid purchase
1 1 1
2 1 1
3 2 1
1 3 1
end
I want to add a row for every id-productid combo to create the following dataset
id productid purchase
1 1 1
2 1 1
3 1 0
1 2 0
2 2 0
3 2 1
1 3 1
2 3 0
3 3 0
end
I have tried a lot that has not work. This is my latest.
qui sum id, d
local obs = r(N)
expand = `obs'
levelsof productid, local(id)
local j = 1
foreach i of local id {
replace productid = `i' if `j' == id
local j = `j' + 1
}
The fillin command (see help fillin) is the tool for this task.
Starting with your sample data in memory:
fillin id productid
replace purchase = 0 if _fillin
drop _fillin
sort productid id
list, sepby(productid) abbreviate(12)
produces
+---------------------------+
| id productid purchase |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 1 |
3. | 3 1 0 |
|---------------------------|
4. | 1 2 0 |
5. | 2 2 0 |
6. | 3 2 1 |
|---------------------------|
7. | 1 3 1 |
8. | 2 3 0 |
9. | 3 3 0 |
+---------------------------+

Count observations within dynamic range

Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)