Aggregate dummy variables to multiple categorical variables - stata

I have 8 dummy variables (0/1). Those 8 variables have to be aggregated to one categorical variable with 8 items (categories). Normally, people should have just marked one out of the 8 dummy variables, but some marked multiple ones.
When a Person has marked two items, the first value should go into the first categorical variable, whereas the second value should go to the second categorical variable. When there are 3 items marked, the third values should go into a third categorical variable and so on (up to 3).
I know how to aggregate the dummies to a categorical variable, but I do not know which approach there is to divide the values to different variables, based on the number of marked dummies.
If the problem is not clear, please tell me. It was difficult for me to describe it properly.
Edit:
My approach is the follwoing:
local MCM_zahl4 F0801 F0802 F0803 F0804 F0805 F0806 F0807 F0808
gen MCM_zaehl_4 = 0
foreach var of varlist `MCM_zahl4' {
replace MCM_zaehl_4 = MCM_zaehl_4 + 1 if `var' == 1
}
tab MCM_zaehl_4
/*
MCM_zaehl_4 | Freq. Percent Cum.
------------+-----------------------------------
0 | 31 4.74 4.74
1 | 598 91.44 96.18
2 | 22 3.36 99.54
3 | 3 0.46 100.00
------------+-----------------------------------
Total | 654 100.00
*/
gen bildu2 = -999999
gen bildu2_D = -999999
replace bildu2 = 1 if F0801 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 2 if F0802 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 3 if F0803 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 4 if F0804 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 5 if F0805 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 6 if F0806 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 7 if F0807 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 8 if F0808 == 1 & MCM_zaehl_4 == 1
Then I split all cases MCM_zaehl_4 > 1 manually in three variables.
E. g. for two mcm:
replace bildu2 = 5 if ID == XXX
replace bildu2_D = 2 if ID == XXX
For that approach I'd need an auomation, because for more observations I won't be able to do it manually.

If I understood you correctly, you could try the following to aggregate your multiples dummy variables into multiple aggregate columns based on the number of answers that the person marked. It assumes the repeated answers are consecutive. I reduced your problem to 6 dummy (a1-a6) and people can answer up to 3 questions.
clear
input id a1 a2 a3 a4 a5 a6
1 1 0 0 0 0 0
2 1 1 0 0 0 0
3 1 1 1 0 0 0
4 1 1 1 0 0 0
5 0 1 0 0 0 0
6 1 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 1
end
egen n_asnwers = rowtotal(a*)
gen wanted_1 = .
gen wanted_2 = .
gen wanted_3 = .
local i = 1
foreach v of varlist a* {
replace wanted_1 = `v' if `v' == 1 & n_asnwers == 1
replace wanted_2 = `v' if `v' == 1 & n_asnwers == 2
replace wanted_3 = `v' if `v' == 1 & n_asnwers == 3
local ++i
}
list
/*
+------------------------------------------------------------------------------+
| id a1 a2 a3 a4 a5 a6 n_asnw~s wanted_1 wanted_2 wanted_3 |
|------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 . . |
2. | 2 1 1 0 0 0 0 2 . 1 . |
3. | 3 1 1 1 0 0 0 3 . . 1 |
4. | 4 1 1 1 0 0 0 3 . . 1 |
5. | 5 0 1 0 0 0 0 1 1 . . |
|------------------------------------------------------------------------------|
6. | 6 1 0 0 0 0 0 1 1 . . |
7. | 7 0 0 0 0 1 0 1 1 . . |
8. | 8 0 0 0 0 0 1 1 1 . . |
+------------------------------------------------------------------------------+
*/

Related

Connect IDs based on values in rows, ignoring connection between identical IDs

This is a follow-up to my previous question: Connect IDs based on values in rows.
I would now like to consider the case, where connections between identical idb's should be classified as 0.
The output is similar to the matrix in my previous post but with diagonal elements equal to 0:
62014 62015 62016 62017 62018
62014 0 1 0 1 1
62015 1 0 0 0 0
62016 0 0 0 0 1
62017 1 0 0 0 1
62018 1 0 1 1 0
How can I do this in Stata?
You can easily change the values in the diagonal of a matrix as follows:
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 1 |
2 | 1 1 |
3 | 0 0 1 |
4 | 1 0 0 1 |
5 | 1 0 1 1 1 |
+---------------------+
: _diag(B, 0)
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 0 |
2 | 1 0 |
3 | 0 0 0 |
4 | 1 0 0 0 |
5 | 1 0 1 1 0 |
+---------------------+
In the context of your question, you can simply do the following:
mata: B = foo1(A)
mata: _diag(B, 0)
getmata (idb*) = B
list
+------------------------------------------------------------------------+
| idb idd1 idd2 idd3 idb1 idb2 idb3 idb4 idb5 |
|------------------------------------------------------------------------|
1. | 62014 370490 879271 1112878 0 1 0 1 1 |
2. | 62015 457013 1112878 370490 1 0 0 0 0 |
3. | 62016 341863 1366174 533773 0 0 0 0 1 |
4. | 62017 879271 327069 341596 1 0 0 0 1 |
5. | 62018 1391443 1366174 879271 1 0 1 1 0 |
+------------------------------------------------------------------------+

Check whether a range contains specific values

Var1 is given. Var2 should take value 1 if the Observation or one of the previous 5 observations is a missing value or 0. What is the Syntax for Var2?
I know how to do it with a lot of if Statements. But when I need to do it for the previous 50 observations that gets too inconvenient.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
The question is similar to your previous --Finding the second smallest value -- which you should quote. So is this answer. rangestat is from SSC.
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
gen long id = _n
gen Bad = inlist(Var1, 0, .)
rangestat (sum) Bad, int(id -5 0)
list, sepby(Bad_sum)
+----------------------------------+
| Var1 Var2 id Bad Bad_sum |
|----------------------------------|
1. | 5 0 1 0 0 |
|----------------------------------|
2. | . 1 2 1 1 |
3. | 2 1 3 0 1 |
4. | 5 1 4 0 1 |
5. | 7 1 5 0 1 |
6. | 9 1 6 0 1 |
7. | 5 1 7 0 1 |
|----------------------------------|
8. | 9 0 8 0 0 |
|----------------------------------|
9. | 0 1 9 1 1 |
10. | 2 1 10 0 1 |
11. | 7 1 11 0 1 |
12. | 5 1 12 0 1 |
13. | 3 1 13 0 1 |
14. | 2 1 14 0 1 |
|----------------------------------|
15. | 5 0 15 0 0 |
+----------------------------------+

Use Stata to generate new variable, based on combination types of variables

Say I have a dataset with three variables a, b, c, and having 5 observations. Like the following:
a b c
1 0 1
1 1 1
0 1 0
0 1 1
1 0 0
Now I want to generate a new variable called type, which is a possible combination of variable a, b and c. Specifically,
type=1 if a=b=c=0
type=2 if a=c=0 & b=1
type=3 if a=b=0 & c=1
type=4 if a=0 & b=c=1
type=5 if a=1 & b=c=0
type=6 if a=b=1 & c=0
type=7 if a=c=1 & b=0
type=8 if a=b=c=1
The new dataset I want to get is:
a b c type
1 0 1 7
1 1 1 8
0 1 0 2
0 1 1 4
1 0 0 5
Are there any general ways to realize this in Stata? It's better if this can also be extended when type is large, say 100 types. Thx a lot.
If the specific values of type don't matter, egen's group function works.
E.g.:
clear
input a b c
1 0 1
1 1 1
0 1 0
0 1 1
1 0 0
0 1 0
1 1 1
end
sort a b c // not necessary, but clearer presentation
egen type = group(a b c)
li
with the result
+------------------+
| a b c type |
|------------------|
1. | 0 1 0 1 |
2. | 0 1 0 1 |
3. | 0 1 1 2 |
4. | 1 0 0 3 |
5. | 1 0 1 4 |
|------------------|
6. | 1 1 1 5 |
7. | 1 1 1 5 |
+------------------+

Updated exposure variables in Stata

I'm trying to create a variable for updated body mass index (bmi) through 4 visits of a study. I've tried the below but it only lists the value from the last visit. My data is in wide format where visit_v1 = 1 if the participant was present for visit 1 and bmi_v1 = bmi at visit 1. I want bmi_su to equal bmi_v1 if visit_v1=1, bmi_v2 if visit_v2==1, etc. Any thoughts where I'm going wrong?
gen bmi_su = .
replace bmi_su = bmi_v4 if visit_v4==1
replace bmi_su = bmi_v3 if visit_v3==1 & visit_v4==0
replace bmi_su = bmi_v2 if visit_v2==1 & visit_v4==0 & visit_v3==0
replace bmi_su = bmi_v1 if visit_v1==1 & visit_v4==0 & visit_v3==0 & visit_v2==0
Do you seek something like this:
. clear all
. set more off
.
. * Assumed data structure
. input ///
> id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3
1. 1 20 1 0 0 20 0 0
2. 1 . 0 1 0 0 25 0
3. 1 . 0 0 1 0 0 28
4. end
.
. list, noobs
+----------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 |
|----------------------------------------------------------|
| 1 20 1 0 0 20 0 0 |
| 1 . 0 1 0 0 25 0 |
| 1 . 0 0 1 0 0 28 |
+----------------------------------------------------------+
.
. * What you want?
. gen bmisu = bmi1 + bmi2 + bmi3
.
. list, noobs
+------------------------------------------------------------------+
| id bmi visit1 visit2 visit3 bmi1 bmi2 bmi3 bmisu |
|------------------------------------------------------------------|
| 1 20 1 0 0 20 0 0 20 |
| 1 . 0 1 0 0 25 0 25 |
| 1 . 0 0 1 0 0 28 28 |
+------------------------------------------------------------------+
?
Panel or longitudinal data are usually much better off in a long data structure or shape (some say format).
In your case, the definitions imply that the last measurement will trump earlier measurements, so it is not clear why you seem surprised.
Here are some more systematic ways to do calculations. First,
gen bmi_su = bmi_v4
forval j = 3(-1)1 {
replace bmi_su = bmi_v`j' if visit`j'
}
Second,
gen bmi_su2 = bmi_v1
forval j = 2/4 {
replace bmi_su2 = bmi_v`j' if visit`j'
}
Consider also variants of the above with if missing(bmi_su) or if missing(bmi_su2) rather than the if conditions shown.

Stata: Maximum number of consecutive occurrences of the same value across variables

Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}
Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
 local lookfor : di _dup(42) "1" 
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+