Identify group with two variables - stata

Suppose I have the following data in Stata:
clear
input id tna ret str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
end
I want to make an ID for new groups. These new groups should incorporate the observations with the same name (for example X), but should also incorporate all the observations of the same ID if the name started in that ID. For example:
X is in the data set under two IDs: 1 and 2. The group of X should incorporate all the observations with the name X, but also the two observations of the name P (since X started in ID 2 and the two observations with value P belong to group X)
Y started in ID 3, so the group should incorporate every observation with ID 3.

This is a tricky problem to solve because it may take several pass to completely stabilize identifiers. Fortunately, you can use group_id (from SSC) to solve this. To install group_id, type in Stata's Command window:
ssc install group_id
Here's a more complicated data example where "P" also appears in ID == 4 and that ID also contains "A" as a name:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id tna ret) str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
4 9 3 "P"
4 11 -2 "P"
4 12 0 "A"
end
clonevar newid = id
group_id newid, match(name)

I am not sure that I understand the definitions here (e.g. tna and ret are not explained; conversely, omit them from a question if they are irrelevant; does "start" imply a process in time?), but why not copy first values of name within each id, and then classify on first names? (With your example data, results are the same.)
clear
input id tna ret str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
end
sort id, stable
by id: gen first = name[1]
egen group = group(first), label
list, sepby(group)
+---------------------------------------+
| id tna ret name first group |
|---------------------------------------|
1. | 1 2 3 X X X |
2. | 1 3 2 X X X |
3. | 1 5 3 X X X |
4. | 1 6 -1 X X X |
5. | 2 4 2 X X X |
6. | 2 6 -1 X X X |
7. | 2 8 -2 X X X |
8. | 2 9 3 P X X |
9. | 2 11 -2 P X X |
|---------------------------------------|
10. | 3 3 1 Y Y Y |
11. | 3 4 0 Y Y Y |
12. | 3 6 -1 Y Y Y |
13. | 3 8 1 Z Y Y |
14. | 3 6 1 Z Y Y |
+---------------------------------------+

Related

Generating variable by groups taking values of certain observations

I have a dataset with only variable values:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
How can I generate a new variable new_var containing a repeating sequence of the first three observations in value?
Many ways to do it: here are two:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
egen index = seq(), to(3)
generate wanted = value[index]
generate direct = cond(mod(_n, 3) == 1, 1, cond(mod(_n, 3) == 2, 3, 5))
list, sep(3)
+-------------------------------------------+
| value new_var index wanted direct |
|-------------------------------------------|
1. | 1 1 1 1 1 |
2. | 3 3 2 3 3 |
3. | 5 5 3 5 5 |
|-------------------------------------------|
4. | 30 1 1 1 1 |
5. | 40 3 2 3 3 |
6. | 50 5 3 5 5 |
|-------------------------------------------|
7. | 11 1 1 1 1 |
8. | 12 3 2 3 3 |
9. | 13 5 3 5 5 |
+-------------------------------------------+

Adding observations between rows

I would like to create new observations as follows:
A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
*3* 1 .
*3* 2 .
*3* 3 .
*3* 4 .
*3* 5 .
4 1 4
4 2 3
4 3 1
The new lines are indicated by asterisks.
How can I create new observations for variable A and B?
This is a simple expand:
clear
input A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
4 1 4
4 2 3
4 3 1
end
generate id = _n
expand 6 if id == 10
replace id = 11 if _n == _N
replace A = 3 if id == 10
replace C = . if id == 10
bysort id: replace B = cond(_n == 1, 1, B[_n-1]+1) if id == 10
Which will produce the desired output:
list, sepby(A)
+----------------+
| A B C id |
|----------------|
1. | 1 1 1 1 |
2. | 1 2 2 2 |
3. | 1 3 4 3 |
4. | 1 4 5 4 |
5. | 1 5 2 5 |
|----------------|
6. | 2 1 1 6 |
7. | 2 2 5 7 |
8. | 2 3 3 8 |
9. | 2 4 3 9 |
|----------------|
10. | 3 1 . 10 |
11. | 3 2 . 10 |
12. | 3 3 . 10 |
13. | 3 4 . 10 |
14. | 3 5 . 10 |
|----------------|
15. | 4 1 4 11 |
16. | 4 2 3 11 |
17. | 4 3 1 12 |
+----------------+
The code could be shorter.
expand 2 if _n < 6
replace A = 3 if _n > _N - 5
*replace B = _n + 5 - _N if A == 3
replace C = . if A == 3
sort A B

Calculate average of all distinct values within group

I have two columns with data.
One has labels for a group and a second displays values for items in each group. I would like to calculate for each group, the average of only those values that are distinct.
How can I do this in Stata?
EDIT:
See my dataset and desired result below:
Group_label Value
x 12
x 12
x 2
x 1
y 5
y 5
y 5
y 2
y 2
I want to generate the following average:
Group_label Value Average
x 12 5
x 12 5
x 2 5
x 1 5
y 5 3.5
y 5 3.5
y 5 3.5
y 2 3.5
y 2 3.5
So the average for x = (12 + 2 + 1) / 3 and for y = (5 + 2) / 2
I have tried the egen(mean) command but it selects all values for each group label.
I only want to select the distinct values.
This is a two-step solution. You first need to tag distinct values using tag() within egen. Then you use mean() within egen.
The most delicate point is that something like ... if tag will leave missing values in the result for observations not selected. How can you omit duplicated values from the calculation yet also spread the result to their observations? See Section 9 of this paper for the use of cond() together with mean() which is one way to do it, exemplified in the code, and perhaps the most transparent way too. See Section 10 of the same paper for another method, which amuses some people.
For a fairly detailed review of distinct observations, see https://www.stata-journal.com/sjpdf.html?articlenum=dm0042
clear
input str1 Group_label Value
x 12
x 12
x 2
x 1
y 5
y 5
y 5
y 2
y 2
end
egen tag = tag(Group_label Value)
egen mean = mean(cond(tag, Value, .)), by(Group_label)
list, sepby(Group_label)
+-------------------------------+
| Group_~l Value tag mean |
|-------------------------------|
1. | x 12 1 5 |
2. | x 12 0 5 |
3. | x 2 1 5 |
4. | x 1 1 5 |
|-------------------------------|
5. | y 5 1 3.5 |
6. | y 5 0 3.5 |
7. | y 5 0 3.5 |
8. | y 2 1 3.5 |
9. | y 2 0 3.5 |
+-------------------------------+
The following works for me:
clear
input str1 vlab val
"x" 12
"x" 12
"x" 2
"x" 1
"y" 5
"y" 5
"y" 5
"y" 2
"y" 2
end
bysort vlab: generate tag = val != val[_n-1]
bysort vlab: egen mean_val = mean(val) if tag == 1
list
+-----------------------------+
| vlab val tag mean_val |
|-----------------------------|
1. | x 12 1 5 |
2. | x 12 0 . |
3. | x 2 1 5 |
4. | x 1 1 5 |
5. | y 5 1 3.5 |
|-----------------------------|
6. | y 5 0 . |
7. | y 5 0 . |
8. | y 2 1 3.5 |
9. | y 2 0 . |
+-----------------------------+
EDIT:
If you also do:
bysort vlab: replace mean_val = mean_val[_n-1] if mean_val == .
You will get:
list
+-----------------------------+
| vlab val tag mean_val |
|-----------------------------|
1. | x 12 1 5 |
2. | x 12 0 5 |
3. | x 2 1 5 |
4. | x 1 1 5 |
5. | y 5 1 3.5 |
|-----------------------------|
6. | y 5 0 3.5 |
7. | y 5 0 3.5 |
8. | y 2 1 3.5 |
9. | y 2 0 3.5 |
+-----------------------------+

Combine overlapping categorical variables

I am trying to "combine" two categorical variables in Stata (say var1 and var2) into a new (also categorical) variable (say res).
The example below illustrates what I am trying to achieve:
var1 var2 res
1 1 A
1 2 A
2 1 A
3 3 B
4 2 A
5 4 D
What this example does is to combine all categories of var1 and var2 that "overlap".
Here, the pair var1 == 1 and var2 == 1 initially form a group (res== A). All other pairs containing var1 == 1 or var2 == 1 should belong to the same group (hence res== A in rows 2 and 3). Because in row 2 we have var2==2, any pair with containing var2==2 should belong to the same group. That's why in row 4 res== A.
Another way to look at this problem is using the following matrix:
| 1 2 3 4
-----------------------
1 | 1 1
2 | 1
3 | 1
4 | 1
5 | 1
Because the element [1,1] is not empty (or zero), all elements in row 1 and column 1 must belong to the same group. Because [1,2] is not empty, the same is true for row 1, column 2. And so on and so forth. It does not matter which row/column you decide to start from.
egen group alone doesn't cut it.
Any ideas?
Sounds like you want to further group var1 if the values of var2 are the same. If that's the case, then you can use a program I wrote called group_id that's available from SSC. To install it, type in Stata's Command window:
ssc install group_id
Here's an example of how you would use it:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(var1 var2) str1 res
1 1 "A"
1 2 "A"
2 1 "A"
3 3 "B"
4 2 "A"
5 4 "D"
end
gen long wanted = var1
group_id wanted, matchby(var2)
list, sep(0)
and the results:
. list, sep(0)
+----------------------------+
| var1 var2 res wanted |
|----------------------------|
1. | 1 1 A 1 |
2. | 1 2 A 1 |
3. | 2 1 A 1 |
4. | 3 3 B 3 |
5. | 4 2 A 1 |
6. | 5 4 D 5 |
+----------------------------+

Check whether a range contains specific values

Var1 is given. Var2 should take value 1 if the Observation or one of the previous 5 observations is a missing value or 0. What is the Syntax for Var2?
I know how to do it with a lot of if Statements. But when I need to do it for the previous 50 observations that gets too inconvenient.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
The question is similar to your previous --Finding the second smallest value -- which you should quote. So is this answer. rangestat is from SSC.
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
gen long id = _n
gen Bad = inlist(Var1, 0, .)
rangestat (sum) Bad, int(id -5 0)
list, sepby(Bad_sum)
+----------------------------------+
| Var1 Var2 id Bad Bad_sum |
|----------------------------------|
1. | 5 0 1 0 0 |
|----------------------------------|
2. | . 1 2 1 1 |
3. | 2 1 3 0 1 |
4. | 5 1 4 0 1 |
5. | 7 1 5 0 1 |
6. | 9 1 6 0 1 |
7. | 5 1 7 0 1 |
|----------------------------------|
8. | 9 0 8 0 0 |
|----------------------------------|
9. | 0 1 9 1 1 |
10. | 2 1 10 0 1 |
11. | 7 1 11 0 1 |
12. | 5 1 12 0 1 |
13. | 3 1 13 0 1 |
14. | 2 1 14 0 1 |
|----------------------------------|
15. | 5 0 15 0 0 |
+----------------------------------+