I have a dataset with:
A unique person_id.
Different subjects that the person took in the past (humanities, IT, business etc.).
The Degree of each subject.
This looks as follows:
person_id humanities business IT Degree
1 0 1 0 BSc
1 0 0 1 MSc
2 1 0 0 PhD
2 0 1 0 MSc
2 0 0 1 BSc
3 0 0 1 BSc
I would like to transform this dataset so that I have variables consisting of each possible combination of degree and subject for each person_id.
The idea is that when I collapse it later by person_id, I will have one value for each person (namely 0 or 1). I have twelve different subjects and four main degrees.
person_id humanities business IT Degree BSc_humanities MSc_Hum
1 0 1 0 BSc 0 0
1 0 0 1 MSc 0 0
2 1 0 0 PhD 0 1
2 1 0 0 MSc 0 1
2 0 0 1 BSc 0 1
3 0 0 1 BSc 0 0
What would be the best possible way to achieve this?
You could use fillin:
clear
input person_id humanities business IT str3 Degree
1 0 1 0 BSc
1 0 0 1 MSc
2 1 0 0 PhD
2 0 1 0 MSc
2 0 0 1 BSc
3 0 0 1 BSc
end
fillin person_id humanities business Degree
list person_id humanities business Degree
+-----------------------------------------+
| person~d humani~s business Degree |
|-----------------------------------------|
1. | 1 0 0 BSc |
2. | 1 0 0 MSc |
3. | 1 0 0 PhD |
4. | 1 0 1 BSc |
5. | 1 0 1 MSc |
|-----------------------------------------|
6. | 1 0 1 PhD |
7. | 1 1 0 BSc |
8. | 1 1 0 MSc |
9. | 1 1 0 PhD |
10. | 1 1 1 BSc |
|-----------------------------------------|
11. | 1 1 1 MSc |
12. | 1 1 1 PhD |
13. | 2 0 0 BSc |
14. | 2 0 0 MSc |
15. | 2 0 0 PhD |
|-----------------------------------------|
16. | 2 0 1 BSc |
17. | 2 0 1 MSc |
18. | 2 0 1 PhD |
19. | 2 1 0 BSc |
20. | 2 1 0 MSc |
|-----------------------------------------|
21. | 2 1 0 PhD |
22. | 2 1 1 BSc |
23. | 2 1 1 MSc |
24. | 2 1 1 PhD |
25. | 3 0 0 BSc |
|-----------------------------------------|
26. | 3 0 0 MSc |
27. | 3 0 0 PhD |
28. | 3 0 1 BSc |
29. | 3 0 1 MSc |
30. | 3 0 1 PhD |
|-----------------------------------------|
31. | 3 1 0 BSc |
32. | 3 1 0 MSc |
33. | 3 1 0 PhD |
34. | 3 1 1 BSc |
35. | 3 1 1 MSc |
|-----------------------------------------|
36. | 3 1 1 PhD |
+-----------------------------------------+
Related
This is a follow-up to my previous question: Connect IDs based on values in rows.
I would now like to consider the case, where connections between identical idb's should be classified as 0.
The output is similar to the matrix in my previous post but with diagonal elements equal to 0:
62014 62015 62016 62017 62018
62014 0 1 0 1 1
62015 1 0 0 0 0
62016 0 0 0 0 1
62017 1 0 0 0 1
62018 1 0 1 1 0
How can I do this in Stata?
You can easily change the values in the diagonal of a matrix as follows:
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 1 |
2 | 1 1 |
3 | 0 0 1 |
4 | 1 0 0 1 |
5 | 1 0 1 1 1 |
+---------------------+
: _diag(B, 0)
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 0 |
2 | 1 0 |
3 | 0 0 0 |
4 | 1 0 0 0 |
5 | 1 0 1 1 0 |
+---------------------+
In the context of your question, you can simply do the following:
mata: B = foo1(A)
mata: _diag(B, 0)
getmata (idb*) = B
list
+------------------------------------------------------------------------+
| idb idd1 idd2 idd3 idb1 idb2 idb3 idb4 idb5 |
|------------------------------------------------------------------------|
1. | 62014 370490 879271 1112878 0 1 0 1 1 |
2. | 62015 457013 1112878 370490 1 0 0 0 0 |
3. | 62016 341863 1366174 533773 0 0 0 0 1 |
4. | 62017 879271 327069 341596 1 0 0 0 1 |
5. | 62018 1391443 1366174 879271 1 0 1 1 0 |
+------------------------------------------------------------------------+
Var1 is given. Var2 should take value 1 if the Observation or one of the previous 5 observations is a missing value or 0. What is the Syntax for Var2?
I know how to do it with a lot of if Statements. But when I need to do it for the previous 50 observations that gets too inconvenient.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
The question is similar to your previous --Finding the second smallest value -- which you should quote. So is this answer. rangestat is from SSC.
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
gen long id = _n
gen Bad = inlist(Var1, 0, .)
rangestat (sum) Bad, int(id -5 0)
list, sepby(Bad_sum)
+----------------------------------+
| Var1 Var2 id Bad Bad_sum |
|----------------------------------|
1. | 5 0 1 0 0 |
|----------------------------------|
2. | . 1 2 1 1 |
3. | 2 1 3 0 1 |
4. | 5 1 4 0 1 |
5. | 7 1 5 0 1 |
6. | 9 1 6 0 1 |
7. | 5 1 7 0 1 |
|----------------------------------|
8. | 9 0 8 0 0 |
|----------------------------------|
9. | 0 1 9 1 1 |
10. | 2 1 10 0 1 |
11. | 7 1 11 0 1 |
12. | 5 1 12 0 1 |
13. | 3 1 13 0 1 |
14. | 2 1 14 0 1 |
|----------------------------------|
15. | 5 0 15 0 0 |
+----------------------------------+
The following command can generate dummy variables:
tabulate age, generate(I)
Nevertheless, when I want a dummy based on multiple variables, what should I do?
For example, I would like to do the following concisely:
generate I1=1 if age==1 & year==2000
generate I2=1 if age==1 & year==2001
generate I3=1 if age==2 & year==2000
generate I4=1 if age==2 & year==2001
I have already tried this:
tabulate age year, generate(I)
However, it did not work.
You can get what you want as follows:
sysuse auto, clear
keep if !missing(rep78)
egen rf = group(rep78 foreign)
tabulate rf, generate(I)
group(rep78 |
foreign) | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 27 39.13 53.62
4 | 3 4.35 57.97
5 | 9 13.04 71.01
6 | 9 13.04 84.06
7 | 2 2.90 86.96
8 | 9 13.04 100.00
------------+-----------------------------------
Total | 69 100.00
list I* in 1 / 10
+---------------------------------------+
| I1 I2 I3 I4 I5 I6 I7 I8 |
|---------------------------------------|
1. | 0 0 1 0 0 0 0 0 |
2. | 0 0 1 0 0 0 0 0 |
3. | 0 0 1 0 0 0 0 0 |
4. | 0 0 0 0 1 0 0 0 |
5. | 0 0 1 0 0 0 0 0 |
6. | 0 0 1 0 0 0 0 0 |
7. | 0 0 1 0 0 0 0 0 |
8. | 0 0 1 0 0 0 0 0 |
9. | 0 0 1 0 0 0 0 0 |
10. | 0 1 0 0 0 0 0 0 |
+---------------------------------------+
Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}
Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
local lookfor : di _dup(42) "1"
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+
There is this puzzle question of creating an equivalent bit-wise & with only | and ~ operators.
I've been doing brute force combinations of | and ~ using 6 (0110) and 5 (0101) trying to get 4 (0100), but I still cannot get the answer.
The maximum number of operation can be used is 8.
Can someone please give me hints?
What helps you here is De Morgan's Law, which basically says:
~(a & b) == ~a | ~b
Thus we can just negate this and get:
a & b == ~(~a | ~b) //4 operations
And looking at the truth table (and in fact, god bless the simplicity of binary logic, there are only four possible combintations of inputs to generate the appropriate outputs for) we can see that both are equivalent (last two columns):
a | b | ~a | ~b | ~a OR ~b | ~(~a OR ~b) | a AND b
--|---|----|----|----------|-------------|--------
0 | 0 | 1 | 1 | 1 | 0 | 0
1 | 0 | 0 | 1 | 1 | 0 | 0
0 | 1 | 1 | 0 | 1 | 0 | 0
1 | 1 | 0 | 0 | 0 | 1 | 1
Truth table time...
A B A&B !A !B !A|!B !(!A|!B)
0 0 0 1 1 1 0
0 1 0 1 0 1 0
1 0 0 0 1 1 0
1 1 1 0 0 0 1