I have a dataset in long form that lists observations by month. I want to identify if consecutive rows for a variable can cancel out (in other words, have the same absolute value). And if so, I want to change both observations to zero. In addition, I want to have an additional dummy variable that tells me if I've changed anything for that row. How can I structure the code?
For example,
Date Var1 Var 2
Jan2010 5 6
Feb2010 6 0
Mar2010 -6 1
In the above example, I want to make the dataset into below
Date Var1 Var 2 Dummy
Jan2010 5 6 0
Feb2010 0 0 1
Mar2010 0 0 1
This (seemingly) meets the criteria described, but other considerations may come into play if there are other factors not explicitly mentioned (e.g., do you need to consider whether Var2 "cancels out"? What if Apr2010 is 6? etc.).
clear
input str7 Date Var1 Var2
"Jan2010" 5 6
"Feb2010" 6 0
"Mar2010" -6 1
end
gen Dummy = Var1 == Var1[_n+1] * -1 | Var1 == Var1[_n-1] * -1
replace Var1 = 0 if Dummy
replace Var2 = 0 if Dummy
li , noobs
yielding
+-------------------------------+
| Date Var1 Var2 Dummy |
|-------------------------------|
| Jan2010 5 6 0 |
| Feb2010 0 0 1 |
| Mar2010 0 0 1 |
+-------------------------------+
Or perhaps more correctly, Dummy should be generated with respect to actual months and not observations:
gen Month = monthly(Date, "MY")
format Month %tm
tsset Month , monthly
gen Dummy = Var1 == Var1[_n+1] * -1 | Var1 == Var1[_n-1] * -1
Edit: As Roberto rightly points out, the previous code (using abs()) was written based on the example posted, but multiplying by -1 is more robust and yields the same result (for the sample data posted). And the suggestion to preserve the original variables is of course a generally good idea.
Related
My dataset contains multiple variables called avar_1 to bvar_10 referring to the history of an individual. For some reasons, the history is not always complete and there are some "gaps" (e.g. avar_1 and avar_4 are non-missing, but avar_2 and avar_3 are missing). For each individual, I want to store the first non-missing value in a new variable called var1 the second non-missing in var2 etc, so that I have a history without missing values.
I've tried the following code
local x=1
foreach wave in a b {
forval i=1/10 {
capture drop var`x'
generate var`x'=.
capture replace var`x'=`wave'var`i' if !mi(`wave'`var'`i')
if (!mi(var`x')) {
local x=1+`x'
}
}
}
var1 is generated properly but var2 only contains missings and following variables are not generated. However, I set trace on and saw that the var2 is actually replaced for all variables from avar_1 to bvar_10.
My guess is that the local x is not correctly updated as its value change for the whole dataset but should be different for each observation.
Is that the problem and if so, how can I avoid it?
A concise concrete data example is worth more than a long explanation. Your description seems consistent with an example like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id float(avar_1 avar_2 avar_3 bvar_1 bvar_2)
"A" 1 . 6 8 10
"B" 2 4 . 9 .
"C" 3 5 7 . 11
end
* 4 is specific to this example.
rename (bvar_*) (avar_#), renumber(4)
reshape long avar_, i(id) j(which)
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 15
Number of variables 6 -> 3
j variable (5 values) -> which
xij variables:
avar_1 avar_2 ... avar_5 -> avar_
-----------------------------------------------------------------------------
drop if missing(avar_)
bysort id (which) : replace which = _n
list, sepby(id)
+--------------------+
| id which avar_ |
|--------------------|
1. | A 1 1 |
2. | A 2 6 |
3. | A 3 8 |
4. | A 4 10 |
|--------------------|
5. | B 1 2 |
6. | B 2 4 |
7. | B 3 9 |
|--------------------|
8. | C 1 3 |
9. | C 2 5 |
10. | C 3 7 |
11. | C 4 11 |
+--------------------+
Positive points:
Your data layout cries out for some structure given by a rename and especially by a reshape long. I don't give here code for a reshape wide as for the great majority of Stata purposes, you'd be better off with this layout.
Negative points:
!mi(var`x')
returns whether the first value of a variable is not missing. If foo were a variable in the dataset, !mi(foo) is evaluated as !mi(foo[1]). That is not what you want here. See https://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for the full story.
I'd recommend more evocative variable names.
My data
I am working on a spell dataset in the following format:
cls
clear all
set more off
input id spellnr str7 bdate_str str7 edate_str employed
1 1 2008m1 2008m9 1
1 2 2008m12 2009m8 0
1 3 2009m11 2010m9 1
1 4 2010m10 2011m9 0
///
2 1 2007m4 2009m12 1
2 2 2010m4 2011m4 1
2 3 2011m6 2011m8 0
end
* translate to Stata monthly dates
gen bdate = monthly(bdate_str,"YM")
gen edate = monthly(edate_str,"YM")
drop *_str
format %tm bdate edate
list, sepby(id)
Corresponding to:
+---------------------------------------------+
| id spellnr employed bdate edate |
|---------------------------------------------|
1. | 1 1 1 2008m1 2008m9 |
2. | 1 2 0 2008m12 2009m8 |
3. | 1 3 1 2009m11 2010m9 |
4. | 1 4 0 2010m10 2011m9 |
|---------------------------------------------|
5. | 2 1 1 2007m4 2009m12 |
6. | 2 2 1 2010m4 2011m4 |
7. | 2 3 0 2011m6 2011m8 |
+---------------------------------------------+
Here a given person (id) can have multiple spells (spellnr) of two types (unempl: 1 for unemployment; 0 for employment). the start-end dates of each spell are definied by bdate and edate, respectively.
Imagine the data was already cleaned, and is such that no spells overlap with each other.
There might be "missing" periods in between any two spells though.
This is captured by the dummy dataset above.
My question:
For each unemployment spell, I need to compute the number of months spent in employment in the last 6 months, 12 months, and 24 months.
Note that, importantly, each id can go in and out from employment, and all past employment spells should be taken into account (not just the last one).
In my example, this would lead to the following desired output:
+--------------------------------------------------------------+
| id spellnr employed bdate edate m6 m24 m48 |
|--------------------------------------------------------------|
1. | 1 1 1 2008m1 2008m9 . . . |
2. | 1 2 0 2008m12 2009m8 4 9 9 |
3. | 1 3 1 2009m11 2010m9 . . . |
4. | 1 4 0 2010m10 2011m9 6 11 20 |
|--------------------------------------------------------------|
5. | 2 1 1 2007m4 2009m12 . . . |
6. | 2 2 1 2010m4 2011m4 . . . |
7. | 2 3 0 2011m6 2011m8 5 20 44 |
+--------------------------------------------------------------+
My (working) attempt:
The following code returns the desired result.
* expand each spell to one observation per time unit (here "months"; works also for days)
expand edate-bdate+1
bysort id spellnr: gen spell_date = bdate + _n - 1
format %tm spell_date
list, sepby(id spellnr)
* fill-in empty months (not covered by spells)
xtset id spell_date, monthly
tsfill
* compute cumulative time spent in employment and lagged values
bysort id (spell_date): gen cum_empl = sum(employed) if employed==1
bysort id (spell_date): replace cum_empl = cum_empl[_n-1] if cum_empl==.
bysort id (spell_date): gen lag_7 = L7.cum_empl if employed==0
bysort id (spell_date): gen lag_24 = L25.cum_empl if employed==0
bysort id (spell_date): gen lag_48 = L49.cum_empl if employed==0
qui replace lag_7=0 if lag_7==. & employed==0 // fix computation for first spell of each "id" (if not enough time to go back with "L.")
qui replace lag_24=0 if lag_24==. & employed==0
qui replace lag_48=0 if lag_48==. & employed==0
* compute time spent in employment in the last 6, 24, 48 months, at the beginning of each unemployment spell
bysort id (spell_date): gen m6 = cum_empl - lag_7 if employed==0
bysort id (spell_date): gen m24 = cum_empl - lag_24 if employed==0
bysort id (spell_date): gen m48 = cum_empl - lag_48 if employed==0
qui drop if (spellnr==.)
qui bysort id spellnr (spell_date): keep if _n == 1
drop spell_date cum_empl lag_*
list
This works fine, but becomes quite inefficient when using (several millions of) daily data. Can you suggest any alternative approach that does not involve expanding the dataset?
In words what I do above is:
I expand data to have one row per month;
I fill-in the "gaps" in between the spells with -tsfill-
I Compute the running time spent in employment, and use lag operators to get the three quantities of interest.
This is in the vein of what done here, in a past question that I posted. However the working example there was unnecessarily complicated and with some mistakes.
SOLUTIONS PERFORMANCE
I tried different approaches suggested in the accepted answer below (including using joinby as suggested in an earlier version of the answer). In order to create a larger dataset I used:
expand 500000
bysort id spellnr: gen new_id = _n
drop id
rename new_id id
which creates a dataset with 500,000 id's (for a total of 3,500,000 spells).
The first solution largely dominates the ones that use joinby or rangejoin (see also the comments to the accepted answer below).
Below code might save some running time.
bys id (employed): gen tag = _n if !employed
sum tag, meanonly
local maxtag = `r(max)'
foreach i in 6 24 48 {
gen m`i' = .
forval d = 1/`maxtag' {
by id: gen x = 1 + min(bdate[`d'],edate) - max(bdate[`d']-`i',bdate) if employed
egen y = total(x*(x>0)), by(id)
replace m`i' = y if tag == `d'
drop x y
}
}
sort id bdate
The same logic, along with -rangejoin- (ssc) should also deserve a try. Please kindly provide some feedback after testing with your (large) actual data.
preserve
keep if employed
replace employed = 0
tempfile em
save `em'
restore
foreach i in 6 24 48 {
gen _bd = bdate - `i'
rangejoin edate _bd bdate using `em', by(id employed) p(_)
egen m`i' = total(_edate - max(_bd,_bdate)+1) if !employed, by(id bdate)
bys id bdate: keep if _n==1
drop _*
}
Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.
I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.
Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)