I would like to find a way to create an indicator flag across rows such that once a criterion has been met, the flag persists across all cases within a group.
In the sample data below, I have a variable _p that defines statistical significance of the comparison of values in _mar across levels of _m. I also have a grouping variable _g that indicates the comparisons are made within a group.
The variables _f_s and _f_n represent the end result that I would like to have.
clear
input _mar _m _p _g _f_s _f_n
2.99 0 0.00000 0 1 0
3.03 1 0.00000 0 1 0
3.05 2 0.00000 0 1 1
3.06 3 0.22179 0 0 1
3.07 4 0.18044 0 0 1
3.07 5 0.58009 0 0 1
3.06 6 0.40620 0 0 1
3.06 7 0.47257 0 0 1
3.06 8 0.91196 0 0 1
3.05 9 0.68560 0 0 1
2.65 0 0.00000 1 1 0
2.70 1 0.00000 1 1 0
2.73 2 0.00103 1 1 0
2.75 3 0.00944 1 1 1
2.75 4 0.64713 1 0 1
2.76 5 0.55476 1 0 1
2.77 6 0.32807 1 0 1
2.78 7 0.03271 1 0 1
2.78 8 0.00219 1 0 1
2.79 9 0.57361 1 0 1
end
I would like to use the flag to indicate in a graph where statistical significance "stops" and ignore other comparisons values.
Below you can also find the code that I have attempted up to this point:
Snippet 1 - graph works, lines are structured as desired
snapshot save, label("import")
snapshot list
twoway ///
(line _mar _m if _g == 0 & _f_s==1, lcolor(orange) lpattern(solid)) ///
(line _mar _m if _g == 0 & _f_n==1, lcolor(orange) lpattern(dash )) ///
(scatter _mar _m if _g == 0, mcolor(orange) msymbol(o) mlabel(_mar) mlabcolor(orange) mlabsize(vsmall) mlabposition(11)) ///
///
(line _mar _m if _g == 1 & _f_s==1, lcolor(blue*2) lpattern(solid)) ///
(line _mar _m if _g == 1 & _f_n==1, lcolor(blue*2) lpattern(dash )) ///
(scatter _mar _m if _g == 1, mcolor(blue*2) msymbol(o) mlabel(_mar) mlabcolor(blue*2) mlabsize(vsmall) mlabposition(11)) ///
, legend(off) ///
xlabel(-1(1)9 -1 " " 0 "0 " 9 "9+" ) ///
ylabel(2.5(0.10)3.5, angle(horizontal) format(%5.2f) ) ymlabel(2.5(0.10)3.5, grid nolabel) ///
xtitle( "Levels" ) ytitle("Adjusted First Year GPA", height(8) ) ///
name(good)
Snippet 2 - graph does not work, lines are not structured as desired
snapshot restore 1
sort _g _m
gen x_f_s = (_p <= .05)
replace x_f_s = 0 if x_f_s ==1 & x_f_s[_n-1]==0 & x_f_s[_n+1]==0
replace x_f_s = 1 if _m == 0
gen x_f_n = x_f_s == 0
replace x_f_n = 1 if x_f_s ==1 & x_f_s[_n+1]==0
/***** the created flags are not correct *****/
list, sepby(_g)
twoway ///
(line _mar _m if _g == 0 & x_f_s==1, lcolor(orange) lpattern(solid)) ///
(line _mar _m if _g == 0 & x_f_n==1, lcolor(orange) lpattern(dash )) ///
(scatter _mar _m if _g == 0, mcolor(orange) msymbol(o) mlabel(_mar) mlabcolor(orange) mlabsize(vsmall) mlabposition(11)) ///
///
(line _mar _m if _g == 1 & x_f_s==1, lcolor(blue*2) lpattern(solid)) ///
(line _mar _m if _g == 1 & x_f_n==1, lcolor(blue*2) lpattern(dash )) ///
(scatter _mar _m if _g == 1, mcolor(blue*2) msymbol(o) mlabel(_mar) mlabcolor(blue*2) mlabsize(vsmall) mlabposition(11)) ///
, legend(off) ///
xlabel(-1(1)9 -1 " " 0 "0 " 9 "9+" ) ///
ylabel(2.5(0.10)3.5, angle(horizontal) format(%5.2f) ) ymlabel(2.5(0.10)3.5, grid nolabel) ///
xtitle( "Levels" ) ytitle("Adjusted First Year GPA", height(8) ) ///
name(not_good)
The variables that I have tried to calculate are noted with x_f_s and x_f_n.
The flags work when there are no subsequent statistical comparisons that happen to be significant. However, when there is a significant comparison after the initial "stop" the plotting does not work.
There should also be a second flag that indicates where "non-significance" starts. This would carry forward in a similar way to the first flag.
I am using solid and dashed lines to indicate where significance exists, and then stops.
Ultimately, I would like to create flags within groups for plotting purposes.
This is how I would do it:
bysort _g (_m): generate x_f_s = (_p <= .05)
bysort _g (_m): generate x_f_n = x_f_s == 0
list, sepby(_g)
+-------------------------------------------------------+
| _mar _m _p _g _f_s _f_n x_f_s x_f_n |
|-------------------------------------------------------|
1. | 2.99 0 0 0 1 0 1 0 |
2. | 3.03 1 0 0 1 0 1 0 |
3. | 3.05 2 0 0 1 1 1 0 |
4. | 3.06 3 .22179 0 0 1 0 1 |
5. | 3.07 4 .18044 0 0 1 0 1 |
6. | 3.07 5 .58009 0 0 1 0 1 |
7. | 3.06 6 .4062 0 0 1 0 1 |
8. | 3.06 7 .47257 0 0 1 0 1 |
9. | 3.06 8 .91196 0 0 1 0 1 |
10. | 3.05 9 .6856 0 0 1 0 1 |
|-------------------------------------------------------|
11. | 2.65 0 0 1 1 0 1 0 |
12. | 2.7 1 0 1 1 0 1 0 |
13. | 2.73 2 .00103 1 1 0 1 0 |
14. | 2.75 3 .00944 1 1 1 1 0 |
15. | 2.75 4 .64713 1 0 1 0 1 |
16. | 2.76 5 .55476 1 0 1 0 1 |
17. | 2.77 6 .32807 1 0 1 0 1 |
18. | 2.78 7 .03271 1 0 1 1 0 |
19. | 2.78 8 .00219 1 0 1 1 0 |
20. | 2.79 9 .57361 1 0 1 0 1 |
+-------------------------------------------------------+
This is how you can automate the application of the first rule:
bysort _g (_m): generate x_f_s = (_p <= .05)
clonevar tag = x_f_s
local i 1
while `i'== 1 {
capture noisily {
bysort _g (_m): assert x_f_s == 0 if _p <= .05 & (tag == 1 & tag[_n-1] == 0)
}
if _rc {
bysort _g (_m): replace x_f_s = 0 if _p <= .05 & (tag == 1 & tag[_n-1] == 0)
drop tag
clonevar tag = x_f_s
}
else local i 0
}
drop tag
Which produces the desired output for x_f_s:
list
+-----------------------------------------------+
| _mar _m _p _g _f_s _f_n x_f_s |
|-----------------------------------------------|
1. | 2.99 0 0 0 1 0 1 |
2. | 3.03 1 0 0 1 0 1 |
3. | 3.05 2 0 0 1 1 1 |
4. | 3.06 3 .22179 0 0 1 0 |
5. | 3.07 4 .18044 0 0 1 0 |
|-----------------------------------------------|
6. | 3.07 5 .58009 0 0 1 0 |
7. | 3.06 6 .4062 0 0 1 0 |
8. | 3.06 7 .47257 0 0 1 0 |
9. | 3.06 8 .91196 0 0 1 0 |
10. | 3.05 9 .6856 0 0 1 0 |
|-----------------------------------------------|
11. | 2.65 0 0 1 1 0 1 |
12. | 2.7 1 0 1 1 0 1 |
13. | 2.73 2 .00103 1 1 0 1 |
14. | 2.75 3 .00944 1 1 1 1 |
15. | 2.75 4 .64713 1 0 1 0 |
|-----------------------------------------------|
16. | 2.76 5 .55476 1 0 1 0 |
17. | 2.77 6 .32807 1 0 1 0 |
18. | 2.78 7 .03271 1 0 1 0 |
19. | 2.78 8 .00219 1 0 1 0 |
20. | 2.79 9 .57361 1 0 1 0 |
+-----------------------------------------------+
The second rule is more straightforward as you only need to replace just before the cut-off point:
bysort _g (_m): generate x_f_n = x_f_s == 0
bysort _g (_m): replace x_f_n = 1 if x_f_s == 1 & x_f_s[_n+1]== 0
list
+-------------------------------------------------------+
| _mar _m _p _g _f_s _f_n x_f_s x_f_n |
|-------------------------------------------------------|
1. | 2.99 0 0 0 1 0 1 0 |
2. | 3.03 1 0 0 1 0 1 0 |
3. | 3.05 2 0 0 1 1 1 1 |
4. | 3.06 3 .22179 0 0 1 0 1 |
5. | 3.07 4 .18044 0 0 1 0 1 |
|-------------------------------------------------------|
6. | 3.07 5 .58009 0 0 1 0 1 |
7. | 3.06 6 .4062 0 0 1 0 1 |
8. | 3.06 7 .47257 0 0 1 0 1 |
9. | 3.06 8 .91196 0 0 1 0 1 |
10. | 3.05 9 .6856 0 0 1 0 1 |
|-------------------------------------------------------|
11. | 2.65 0 0 1 1 0 1 0 |
12. | 2.7 1 0 1 1 0 1 0 |
13. | 2.73 2 .00103 1 1 0 1 0 |
14. | 2.75 3 .00944 1 1 1 1 1 |
15. | 2.75 4 .64713 1 0 1 0 1 |
|-------------------------------------------------------|
16. | 2.76 5 .55476 1 0 1 0 1 |
17. | 2.77 6 .32807 1 0 1 0 1 |
18. | 2.78 7 .03271 1 0 1 0 1 |
19. | 2.78 8 .00219 1 0 1 0 1 |
20. | 2.79 9 .57361 1 0 1 0 1 |
+-------------------------------------------------------+
Related
I have 8 dummy variables (0/1). Those 8 variables have to be aggregated to one categorical variable with 8 items (categories). Normally, people should have just marked one out of the 8 dummy variables, but some marked multiple ones.
When a Person has marked two items, the first value should go into the first categorical variable, whereas the second value should go to the second categorical variable. When there are 3 items marked, the third values should go into a third categorical variable and so on (up to 3).
I know how to aggregate the dummies to a categorical variable, but I do not know which approach there is to divide the values to different variables, based on the number of marked dummies.
If the problem is not clear, please tell me. It was difficult for me to describe it properly.
Edit:
My approach is the follwoing:
local MCM_zahl4 F0801 F0802 F0803 F0804 F0805 F0806 F0807 F0808
gen MCM_zaehl_4 = 0
foreach var of varlist `MCM_zahl4' {
replace MCM_zaehl_4 = MCM_zaehl_4 + 1 if `var' == 1
}
tab MCM_zaehl_4
/*
MCM_zaehl_4 | Freq. Percent Cum.
------------+-----------------------------------
0 | 31 4.74 4.74
1 | 598 91.44 96.18
2 | 22 3.36 99.54
3 | 3 0.46 100.00
------------+-----------------------------------
Total | 654 100.00
*/
gen bildu2 = -999999
gen bildu2_D = -999999
replace bildu2 = 1 if F0801 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 2 if F0802 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 3 if F0803 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 4 if F0804 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 5 if F0805 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 6 if F0806 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 7 if F0807 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 8 if F0808 == 1 & MCM_zaehl_4 == 1
Then I split all cases MCM_zaehl_4 > 1 manually in three variables.
E. g. for two mcm:
replace bildu2 = 5 if ID == XXX
replace bildu2_D = 2 if ID == XXX
For that approach I'd need an auomation, because for more observations I won't be able to do it manually.
If I understood you correctly, you could try the following to aggregate your multiples dummy variables into multiple aggregate columns based on the number of answers that the person marked. It assumes the repeated answers are consecutive. I reduced your problem to 6 dummy (a1-a6) and people can answer up to 3 questions.
clear
input id a1 a2 a3 a4 a5 a6
1 1 0 0 0 0 0
2 1 1 0 0 0 0
3 1 1 1 0 0 0
4 1 1 1 0 0 0
5 0 1 0 0 0 0
6 1 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 1
end
egen n_asnwers = rowtotal(a*)
gen wanted_1 = .
gen wanted_2 = .
gen wanted_3 = .
local i = 1
foreach v of varlist a* {
replace wanted_1 = `v' if `v' == 1 & n_asnwers == 1
replace wanted_2 = `v' if `v' == 1 & n_asnwers == 2
replace wanted_3 = `v' if `v' == 1 & n_asnwers == 3
local ++i
}
list
/*
+------------------------------------------------------------------------------+
| id a1 a2 a3 a4 a5 a6 n_asnw~s wanted_1 wanted_2 wanted_3 |
|------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 . . |
2. | 2 1 1 0 0 0 0 2 . 1 . |
3. | 3 1 1 1 0 0 0 3 . . 1 |
4. | 4 1 1 1 0 0 0 3 . . 1 |
5. | 5 0 1 0 0 0 0 1 1 . . |
|------------------------------------------------------------------------------|
6. | 6 1 0 0 0 0 0 1 1 . . |
7. | 7 0 0 0 0 1 0 1 1 . . |
8. | 8 0 0 0 0 0 1 1 1 . . |
+------------------------------------------------------------------------------+
*/
This is a follow-up to my previous question: Connect IDs based on values in rows.
I would now like to consider the case, where connections between identical idb's should be classified as 0.
The output is similar to the matrix in my previous post but with diagonal elements equal to 0:
62014 62015 62016 62017 62018
62014 0 1 0 1 1
62015 1 0 0 0 0
62016 0 0 0 0 1
62017 1 0 0 0 1
62018 1 0 1 1 0
How can I do this in Stata?
You can easily change the values in the diagonal of a matrix as follows:
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 1 |
2 | 1 1 |
3 | 0 0 1 |
4 | 1 0 0 1 |
5 | 1 0 1 1 1 |
+---------------------+
: _diag(B, 0)
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 0 |
2 | 1 0 |
3 | 0 0 0 |
4 | 1 0 0 0 |
5 | 1 0 1 1 0 |
+---------------------+
In the context of your question, you can simply do the following:
mata: B = foo1(A)
mata: _diag(B, 0)
getmata (idb*) = B
list
+------------------------------------------------------------------------+
| idb idd1 idd2 idd3 idb1 idb2 idb3 idb4 idb5 |
|------------------------------------------------------------------------|
1. | 62014 370490 879271 1112878 0 1 0 1 1 |
2. | 62015 457013 1112878 370490 1 0 0 0 0 |
3. | 62016 341863 1366174 533773 0 0 0 0 1 |
4. | 62017 879271 327069 341596 1 0 0 0 1 |
5. | 62018 1391443 1366174 879271 1 0 1 1 0 |
+------------------------------------------------------------------------+
Var1 is given. Var2 should take value 1 if the Observation or one of the previous 5 observations is a missing value or 0. What is the Syntax for Var2?
I know how to do it with a lot of if Statements. But when I need to do it for the previous 50 observations that gets too inconvenient.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
The question is similar to your previous --Finding the second smallest value -- which you should quote. So is this answer. rangestat is from SSC.
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
gen long id = _n
gen Bad = inlist(Var1, 0, .)
rangestat (sum) Bad, int(id -5 0)
list, sepby(Bad_sum)
+----------------------------------+
| Var1 Var2 id Bad Bad_sum |
|----------------------------------|
1. | 5 0 1 0 0 |
|----------------------------------|
2. | . 1 2 1 1 |
3. | 2 1 3 0 1 |
4. | 5 1 4 0 1 |
5. | 7 1 5 0 1 |
6. | 9 1 6 0 1 |
7. | 5 1 7 0 1 |
|----------------------------------|
8. | 9 0 8 0 0 |
|----------------------------------|
9. | 0 1 9 1 1 |
10. | 2 1 10 0 1 |
11. | 7 1 11 0 1 |
12. | 5 1 12 0 1 |
13. | 3 1 13 0 1 |
14. | 2 1 14 0 1 |
|----------------------------------|
15. | 5 0 15 0 0 |
+----------------------------------+
I have a dataset with:
A unique person_id.
Different subjects that the person took in the past (humanities, IT, business etc.).
The Degree of each subject.
This looks as follows:
person_id humanities business IT Degree
1 0 1 0 BSc
1 0 0 1 MSc
2 1 0 0 PhD
2 0 1 0 MSc
2 0 0 1 BSc
3 0 0 1 BSc
I would like to transform this dataset so that I have variables consisting of each possible combination of degree and subject for each person_id.
The idea is that when I collapse it later by person_id, I will have one value for each person (namely 0 or 1). I have twelve different subjects and four main degrees.
person_id humanities business IT Degree BSc_humanities MSc_Hum
1 0 1 0 BSc 0 0
1 0 0 1 MSc 0 0
2 1 0 0 PhD 0 1
2 1 0 0 MSc 0 1
2 0 0 1 BSc 0 1
3 0 0 1 BSc 0 0
What would be the best possible way to achieve this?
You could use fillin:
clear
input person_id humanities business IT str3 Degree
1 0 1 0 BSc
1 0 0 1 MSc
2 1 0 0 PhD
2 0 1 0 MSc
2 0 0 1 BSc
3 0 0 1 BSc
end
fillin person_id humanities business Degree
list person_id humanities business Degree
+-----------------------------------------+
| person~d humani~s business Degree |
|-----------------------------------------|
1. | 1 0 0 BSc |
2. | 1 0 0 MSc |
3. | 1 0 0 PhD |
4. | 1 0 1 BSc |
5. | 1 0 1 MSc |
|-----------------------------------------|
6. | 1 0 1 PhD |
7. | 1 1 0 BSc |
8. | 1 1 0 MSc |
9. | 1 1 0 PhD |
10. | 1 1 1 BSc |
|-----------------------------------------|
11. | 1 1 1 MSc |
12. | 1 1 1 PhD |
13. | 2 0 0 BSc |
14. | 2 0 0 MSc |
15. | 2 0 0 PhD |
|-----------------------------------------|
16. | 2 0 1 BSc |
17. | 2 0 1 MSc |
18. | 2 0 1 PhD |
19. | 2 1 0 BSc |
20. | 2 1 0 MSc |
|-----------------------------------------|
21. | 2 1 0 PhD |
22. | 2 1 1 BSc |
23. | 2 1 1 MSc |
24. | 2 1 1 PhD |
25. | 3 0 0 BSc |
|-----------------------------------------|
26. | 3 0 0 MSc |
27. | 3 0 0 PhD |
28. | 3 0 1 BSc |
29. | 3 0 1 MSc |
30. | 3 0 1 PhD |
|-----------------------------------------|
31. | 3 1 0 BSc |
32. | 3 1 0 MSc |
33. | 3 1 0 PhD |
34. | 3 1 1 BSc |
35. | 3 1 1 MSc |
|-----------------------------------------|
36. | 3 1 1 PhD |
+-----------------------------------------+
The following command can generate dummy variables:
tabulate age, generate(I)
Nevertheless, when I want a dummy based on multiple variables, what should I do?
For example, I would like to do the following concisely:
generate I1=1 if age==1 & year==2000
generate I2=1 if age==1 & year==2001
generate I3=1 if age==2 & year==2000
generate I4=1 if age==2 & year==2001
I have already tried this:
tabulate age year, generate(I)
However, it did not work.
You can get what you want as follows:
sysuse auto, clear
keep if !missing(rep78)
egen rf = group(rep78 foreign)
tabulate rf, generate(I)
group(rep78 |
foreign) | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 27 39.13 53.62
4 | 3 4.35 57.97
5 | 9 13.04 71.01
6 | 9 13.04 84.06
7 | 2 2.90 86.96
8 | 9 13.04 100.00
------------+-----------------------------------
Total | 69 100.00
list I* in 1 / 10
+---------------------------------------+
| I1 I2 I3 I4 I5 I6 I7 I8 |
|---------------------------------------|
1. | 0 0 1 0 0 0 0 0 |
2. | 0 0 1 0 0 0 0 0 |
3. | 0 0 1 0 0 0 0 0 |
4. | 0 0 0 0 1 0 0 0 |
5. | 0 0 1 0 0 0 0 0 |
6. | 0 0 1 0 0 0 0 0 |
7. | 0 0 1 0 0 0 0 0 |
8. | 0 0 1 0 0 0 0 0 |
9. | 0 0 1 0 0 0 0 0 |
10. | 0 1 0 0 0 0 0 0 |
+---------------------------------------+