Using date range to populate year variables - stata

I have three variables containing a designation date, a last update date, and a geographic identifier (see below)
designation designationupdate fipscode
11/2/12 9/10/21 10001
5/15/02 6/29/12 10005
11/6/12 7/7/21 10005
12/15/20 9/22/22 10005
10/4/22 10/4/22 1001
7/14/97 2/4/10 1001
I am trying to create separate year variables that take on a value of 1 if the date range includes that year and 0 if not (see below).
designation designationupdate fipscode yr2000 yr2001 yr2002 yr2003
01/01/2000 01/01/2002 3004 1 1 1 0
Is there a way to do this?

Your data example is helpful but needs surgery to be useful. See the stata tag wiki for detailed advice about asking Stata questions.
This layout (called "wide") is likely to prove awkward for many Stata needs, but here is what you ask for.
* Example generated by -dataex-.
clear
input float(date1 date2)
19299 22533
15475 19173
19303 22468
22264 22910
22922 22922
13709 18297
end
format %td date1
format %td date2
forval y = 2000/2003 {
gen yr`y' = inrange(`y', year(date1), year(date2))
}
list , sep(0)
+-----------------------------------------------------------+
| date1 date2 yr2000 yr2001 yr2002 yr2003 |
|-----------------------------------------------------------|
1. | 02nov2012 10sep2021 0 0 0 0 |
2. | 15may2002 29jun2012 0 0 1 1 |
3. | 06nov2012 07jul2021 0 0 0 0 |
4. | 15dec2020 22sep2022 0 0 0 0 |
5. | 04oct2022 04oct2022 0 0 0 0 |
6. | 14jul1997 04feb2010 1 1 1 1 |
+-----------------------------------------------------------+

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.

How can I create a trailing count for binary data in Stata?

In Stata, I currently have a data set that looks like:
I am trying to create a "trailing counter" in column B so that it looks like:
Here, the counter starts at 1 and for every time a "1" appears in A, B adds on a value.
This seems to be very simple, but I am not sure how to do this exactly. Here is what I have done so far:
Assuming the column A is called "A" in Stata,
I use:
gen B = A + A[_n - 1]
But, this gives me something off. I am not sure how to proceed, would anyone have any tips?
Here's one way:
clear all
set more off
*----- example data -----
input ///
var1
0
0
0
0
1
0
0
1
0
0
0
end
list, sep(0)
*----- what you want -----
gen counter = sum(var1) + 1
list, sep(0)
The sum() function will give you a cumulative sum. See help sum(). This is a very basic Stata function. A search sum would have gotten you there quickly.
Your approach fails because you are only adding up, for each observation, the "current" value of A with the previous value of itself. That might sound like a cumulative sum, but think about it and you will see that it isn't.
With your code and my data, the result would be:
+----------------+
| var1 counter |
|----------------|
1. | 0 . |
2. | 0 0 |
3. | 0 0 |
4. | 0 0 |
5. | 1 1 |
6. | 0 1 |
7. | 0 0 |
8. | 1 1 |
9. | 0 1 |
10. | 0 0 |
11. | 0 0 |
+----------------+
The first observation for counter is missing (.). That is because there's no previous value for the first observation of var1, so Stata does something like var1[1] + var1[0] = 0 + . = ..
The second observation for counter is var1[2] + var1[1] = 0 + 0 = 0.
The fifth observation for counter is var1[5] + var1[4] = 1 + 0 = 1.
The seventh observation for counter is var1[7] + var1[6] = 0 + 0 = 0. And so on.

Updated exposure variables in Stata

I'm trying to create a variable for updated body mass index (bmi) through 4 visits of a study. I've tried the below but it only lists the value from the last visit. My data is in wide format where visit_v1 = 1 if the participant was present for visit 1 and bmi_v1 = bmi at visit 1. I want bmi_su to equal bmi_v1 if visit_v1=1, bmi_v2 if visit_v2==1, etc. Any thoughts where I'm going wrong?
gen bmi_su = .
replace bmi_su = bmi_v4 if visit_v4==1
replace bmi_su = bmi_v3 if visit_v3==1 & visit_v4==0
replace bmi_su = bmi_v2 if visit_v2==1 & visit_v4==0 & visit_v3==0
replace bmi_su = bmi_v1 if visit_v1==1 & visit_v4==0 & visit_v3==0 & visit_v2==0
Do you seek something like this:
. clear all
. set more off
.
. * Assumed data structure
. input ///
> id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
1. 1 20 1 0 0 20 0 0
2. 1 . 0 1 0 0 25 0
3. 1 . 0 0 1 0 0 28
4. end
.
. list, noobs
+----------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 |
|----------------------------------------------------------|
| 1 20 1 0 0 20 0 0 |
| 1 . 0 1 0 0 25 0 |
| 1 . 0 0 1 0 0 28 |
+----------------------------------------------------------+
.
. * What you want?
. gen bmisu = bmi1 + bmi2 + bmi3
.
. list, noobs
+------------------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 bmisu |
|------------------------------------------------------------------|
| 1 20 1 0 0 20 0 0 20 |
| 1 . 0 1 0 0 25 0 25 |
| 1 . 0 0 1 0 0 28 28 |
+------------------------------------------------------------------+
?
Panel or longitudinal data are usually much better off in a long data structure or shape (some say format).
In your case, the definitions imply that the last measurement will trump earlier measurements, so it is not clear why you seem surprised.
Here are some more systematic ways to do calculations. First,
gen bmi_su = bmi_v4
forval j = 3(-1)1 {
replace bmi_su = bmi_v`j' if visit`j'
}
Second,
gen bmi_su2 = bmi_v1
forval j = 2/4 {
replace bmi_su2 = bmi_v`j' if visit`j'
}
Consider also variants of the above with if missing(bmi_su) or if missing(bmi_su2) rather than the if conditions shown.