Updated exposure variables in Stata - stata

I'm trying to create a variable for updated body mass index (bmi) through 4 visits of a study. I've tried the below but it only lists the value from the last visit. My data is in wide format where visit_v1 = 1 if the participant was present for visit 1 and bmi_v1 = bmi at visit 1. I want bmi_su to equal bmi_v1 if visit_v1=1, bmi_v2 if visit_v2==1, etc. Any thoughts where I'm going wrong?
gen bmi_su = .
replace bmi_su = bmi_v4 if visit_v4==1
replace bmi_su = bmi_v3 if visit_v3==1 & visit_v4==0
replace bmi_su = bmi_v2 if visit_v2==1 & visit_v4==0 & visit_v3==0
replace bmi_su = bmi_v1 if visit_v1==1 & visit_v4==0 & visit_v3==0 & visit_v2==0

Do you seek something like this:
. clear all
. set more off
.
. * Assumed data structure
. input ///
> id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
1. 1 20 1 0 0 20 0 0
2. 1 . 0 1 0 0 25 0
3. 1 . 0 0 1 0 0 28
4. end
.
. list, noobs
+----------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 |
|----------------------------------------------------------|
| 1 20 1 0 0 20 0 0 |
| 1 . 0 1 0 0 25 0 |
| 1 . 0 0 1 0 0 28 |
+----------------------------------------------------------+
.
. * What you want?
. gen bmisu = bmi1 + bmi2 + bmi3
.
. list, noobs
+------------------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 bmisu |
|------------------------------------------------------------------|
| 1 20 1 0 0 20 0 0 20 |
| 1 . 0 1 0 0 25 0 25 |
| 1 . 0 0 1 0 0 28 28 |
+------------------------------------------------------------------+
?

Panel or longitudinal data are usually much better off in a long data structure or shape (some say format).
In your case, the definitions imply that the last measurement will trump earlier measurements, so it is not clear why you seem surprised.
Here are some more systematic ways to do calculations. First,
gen bmi_su = bmi_v4
forval j = 3(-1)1 {
replace bmi_su = bmi_v`j' if visit`j'
}
Second,
gen bmi_su2 = bmi_v1
forval j = 2/4 {
replace bmi_su2 = bmi_v`j' if visit`j'
}
Consider also variants of the above with if missing(bmi_su) or if missing(bmi_su2) rather than the if conditions shown.

Related

Using date range to populate year variables

I have three variables containing a designation date, a last update date, and a geographic identifier (see below)
designation designationupdate fipscode
11/2/12 9/10/21 10001
5/15/02 6/29/12 10005
11/6/12 7/7/21 10005
12/15/20 9/22/22 10005
10/4/22 10/4/22 1001
7/14/97 2/4/10 1001
I am trying to create separate year variables that take on a value of 1 if the date range includes that year and 0 if not (see below).
designation designationupdate fipscode yr2000 yr2001 yr2002 yr2003
01/01/2000 01/01/2002 3004 1 1 1 0
Is there a way to do this?
Your data example is helpful but needs surgery to be useful. See the stata tag wiki for detailed advice about asking Stata questions.
This layout (called "wide") is likely to prove awkward for many Stata needs, but here is what you ask for.
* Example generated by -dataex-.
clear
input float(date1 date2)
19299 22533
15475 19173
19303 22468
22264 22910
22922 22922
13709 18297
end
format %td date1
format %td date2
forval y = 2000/2003 {
gen yr`y' = inrange(`y', year(date1), year(date2))
}
list , sep(0)
+-----------------------------------------------------------+
| date1 date2 yr2000 yr2001 yr2002 yr2003 |
|-----------------------------------------------------------|
1. | 02nov2012 10sep2021 0 0 0 0 |
2. | 15may2002 29jun2012 0 0 1 1 |
3. | 06nov2012 07jul2021 0 0 0 0 |
4. | 15dec2020 22sep2022 0 0 0 0 |
5. | 04oct2022 04oct2022 0 0 0 0 |
6. | 14jul1997 04feb2010 1 1 1 1 |
+-----------------------------------------------------------+

Stata: Reverse Sort

I have a data set that looks like this:
id varA varB varC
1 0 10 .
1 0 20 .
1 0 35 .
2 1 60 76
2 1 76 60
2 0 32 .
I want to create the varC that reverses the order of varB only for the values varA=1 and missing otherwise.
This may help:
clear
input id varA varB varC
1 0 10 .
1 0 20 .
1 0 35 .
2 1 60 76
2 1 76 60
2 0 32 .
end
gen group = sum(id != id[_n-1] | varA != varA[_n-1])
sort group, stable
by group: gen wanted = cond(varA == 1, varB[_N - _n + 1], .)
list id var* wanted, sepby(id varA)
+----------------------------------+
| id varA varB varC wanted |
|----------------------------------|
1. | 1 0 10 . . |
2. | 1 0 20 . . |
3. | 1 0 35 . . |
|----------------------------------|
4. | 2 1 60 76 76 |
5. | 2 1 76 60 60 |
|----------------------------------|
6. | 2 0 32 . . |
+----------------------------------+

Aggregate dummy variables to multiple categorical variables

I have 8 dummy variables (0/1). Those 8 variables have to be aggregated to one categorical variable with 8 items (categories). Normally, people should have just marked one out of the 8 dummy variables, but some marked multiple ones.
When a Person has marked two items, the first value should go into the first categorical variable, whereas the second value should go to the second categorical variable. When there are 3 items marked, the third values should go into a third categorical variable and so on (up to 3).
I know how to aggregate the dummies to a categorical variable, but I do not know which approach there is to divide the values to different variables, based on the number of marked dummies.
If the problem is not clear, please tell me. It was difficult for me to describe it properly.
Edit:
My approach is the follwoing:
local MCM_zahl4 F0801 F0802 F0803 F0804 F0805 F0806 F0807 F0808
gen MCM_zaehl_4 = 0
foreach var of varlist `MCM_zahl4' {
replace MCM_zaehl_4 = MCM_zaehl_4 + 1 if `var' == 1
}
tab MCM_zaehl_4
/*
MCM_zaehl_4 | Freq. Percent Cum.
------------+-----------------------------------
0 | 31 4.74 4.74
1 | 598 91.44 96.18
2 | 22 3.36 99.54
3 | 3 0.46 100.00
------------+-----------------------------------
Total | 654 100.00
*/
gen bildu2 = -999999
gen bildu2_D = -999999
replace bildu2 = 1 if F0801 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 2 if F0802 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 3 if F0803 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 4 if F0804 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 5 if F0805 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 6 if F0806 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 7 if F0807 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 8 if F0808 == 1 & MCM_zaehl_4 == 1
Then I split all cases MCM_zaehl_4 > 1 manually in three variables.
E. g. for two mcm:
replace bildu2 = 5 if ID == XXX
replace bildu2_D = 2 if ID == XXX
For that approach I'd need an auomation, because for more observations I won't be able to do it manually.
If I understood you correctly, you could try the following to aggregate your multiples dummy variables into multiple aggregate columns based on the number of answers that the person marked. It assumes the repeated answers are consecutive. I reduced your problem to 6 dummy (a1-a6) and people can answer up to 3 questions.
clear
input id a1 a2 a3 a4 a5 a6
1 1 0 0 0 0 0
2 1 1 0 0 0 0
3 1 1 1 0 0 0
4 1 1 1 0 0 0
5 0 1 0 0 0 0
6 1 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 1
end
egen n_asnwers = rowtotal(a*)
gen wanted_1 = .
gen wanted_2 = .
gen wanted_3 = .
local i = 1
foreach v of varlist a* {
replace wanted_1 = `v' if `v' == 1 & n_asnwers == 1
replace wanted_2 = `v' if `v' == 1 & n_asnwers == 2
replace wanted_3 = `v' if `v' == 1 & n_asnwers == 3
local ++i
}
list
/*
+------------------------------------------------------------------------------+
| id a1 a2 a3 a4 a5 a6 n_asnw~s wanted_1 wanted_2 wanted_3 |
|------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 . . |
2. | 2 1 1 0 0 0 0 2 . 1 . |
3. | 3 1 1 1 0 0 0 3 . . 1 |
4. | 4 1 1 1 0 0 0 3 . . 1 |
5. | 5 0 1 0 0 0 0 1 1 . . |
|------------------------------------------------------------------------------|
6. | 6 1 0 0 0 0 0 1 1 . . |
7. | 7 0 0 0 0 1 0 1 1 . . |
8. | 8 0 0 0 0 0 1 1 1 . . |
+------------------------------------------------------------------------------+
*/

How can I create a trailing count for binary data in Stata?

In Stata, I currently have a data set that looks like:
I am trying to create a "trailing counter" in column B so that it looks like:
Here, the counter starts at 1 and for every time a "1" appears in A, B adds on a value.
This seems to be very simple, but I am not sure how to do this exactly. Here is what I have done so far:
Assuming the column A is called "A" in Stata,
I use:
gen B = A + A[_n - 1]
But, this gives me something off. I am not sure how to proceed, would anyone have any tips?
Here's one way:
clear all
set more off
*----- example data -----
input ///
var1
0
0
0
0
1
0
0
1
0
0
0
end
list, sep(0)
*----- what you want -----
gen counter = sum(var1) + 1
list, sep(0)
The sum() function will give you a cumulative sum. See help sum(). This is a very basic Stata function. A search sum would have gotten you there quickly.
Your approach fails because you are only adding up, for each observation, the "current" value of A with the previous value of itself. That might sound like a cumulative sum, but think about it and you will see that it isn't.
With your code and my data, the result would be:
+----------------+
| var1 counter |
|----------------|
1. | 0 . |
2. | 0 0 |
3. | 0 0 |
4. | 0 0 |
5. | 1 1 |
6. | 0 1 |
7. | 0 0 |
8. | 1 1 |
9. | 0 1 |
10. | 0 0 |
11. | 0 0 |
+----------------+
The first observation for counter is missing (.). That is because there's no previous value for the first observation of var1, so Stata does something like var1[1] + var1[0] = 0 + . = ..
The second observation for counter is var1[2] + var1[1] = 0 + 0 = 0.
The fifth observation for counter is var1[5] + var1[4] = 1 + 0 = 1.
The seventh observation for counter is var1[7] + var1[6] = 0 + 0 = 0. And so on.

Stata: Maximum number of consecutive occurrences of the same value across variables

Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}
Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
 local lookfor : di _dup(42) "1" 
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+