Calculate average of all distinct values within group - stata

I have two columns with data.
One has labels for a group and a second displays values for items in each group. I would like to calculate for each group, the average of only those values that are distinct.
How can I do this in Stata?
EDIT:
See my dataset and desired result below:
Group_label Value
x 12
x 12
x 2
x 1
y 5
y 5
y 5
y 2
y 2
I want to generate the following average:
Group_label Value Average
x 12 5
x 12 5
x 2 5
x 1 5
y 5 3.5
y 5 3.5
y 5 3.5
y 2 3.5
y 2 3.5
So the average for x = (12 + 2 + 1) / 3 and for y = (5 + 2) / 2
I have tried the egen(mean) command but it selects all values for each group label.
I only want to select the distinct values.

This is a two-step solution. You first need to tag distinct values using tag() within egen. Then you use mean() within egen.
The most delicate point is that something like ... if tag will leave missing values in the result for observations not selected. How can you omit duplicated values from the calculation yet also spread the result to their observations? See Section 9 of this paper for the use of cond() together with mean() which is one way to do it, exemplified in the code, and perhaps the most transparent way too. See Section 10 of the same paper for another method, which amuses some people.
For a fairly detailed review of distinct observations, see https://www.stata-journal.com/sjpdf.html?articlenum=dm0042
clear
input str1 Group_label Value
x 12
x 12
x 2
x 1
y 5
y 5
y 5
y 2
y 2
end
egen tag = tag(Group_label Value)
egen mean = mean(cond(tag, Value, .)), by(Group_label)
list, sepby(Group_label)
+-------------------------------+
| Group_~l Value tag mean |
|-------------------------------|
1. | x 12 1 5 |
2. | x 12 0 5 |
3. | x 2 1 5 |
4. | x 1 1 5 |
|-------------------------------|
5. | y 5 1 3.5 |
6. | y 5 0 3.5 |
7. | y 5 0 3.5 |
8. | y 2 1 3.5 |
9. | y 2 0 3.5 |
+-------------------------------+

The following works for me:
clear
input str1 vlab val
"x" 12
"x" 12
"x" 2
"x" 1
"y" 5
"y" 5
"y" 5
"y" 2
"y" 2
end
bysort vlab: generate tag = val != val[_n-1]
bysort vlab: egen mean_val = mean(val) if tag == 1
list
+-----------------------------+
| vlab val tag mean_val |
|-----------------------------|
1. | x 12 1 5 |
2. | x 12 0 . |
3. | x 2 1 5 |
4. | x 1 1 5 |
5. | y 5 1 3.5 |
|-----------------------------|
6. | y 5 0 . |
7. | y 5 0 . |
8. | y 2 1 3.5 |
9. | y 2 0 . |
+-----------------------------+
EDIT:
If you also do:
bysort vlab: replace mean_val = mean_val[_n-1] if mean_val == .
You will get:
list
+-----------------------------+
| vlab val tag mean_val |
|-----------------------------|
1. | x 12 1 5 |
2. | x 12 0 5 |
3. | x 2 1 5 |
4. | x 1 1 5 |
5. | y 5 1 3.5 |
|-----------------------------|
6. | y 5 0 3.5 |
7. | y 5 0 3.5 |
8. | y 2 1 3.5 |
9. | y 2 0 3.5 |
+-----------------------------+

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

How can I compare values of adjacent observations?

I have two datasets that I have appended together in Stata.
There is one variable, say Age in both data sets. I sorted the data so that the ages are in ascending order. I want to delete the observations in each dataset where the corresponding ages don't match.
Dataset 1:
Obs Age
1 7
2 8
3 10
4 5
Dataset 2:
Obs Age
1 10
2 5
3 9
4 7
Combined and sorted in ascending order:
Obs Age
1 5
2 5
3 7
4 7
5 8
6 9
7 10
8 10
So because the ages when sorted don't match up for observations 5 and 6, I want to delete them. Essentially I want a way to loop through pairs of adjacent numbers and compare their values so that I'm only left with pairs with the same ages.
Looping over observations is inefficient and in the vast majority of cases not necessary.
The following works for me:
clear
input age
5
5
7
7
8
9
10
10
end
generate tag = age != age[_n+1] & age != age[_n-1]
list
+-----------+
| age tag |
|-----------|
1. | 5 0 |
2. | 5 0 |
3. | 7 0 |
4. | 7 0 |
5. | 8 1 |
|-----------|
6. | 9 1 |
7. | 10 0 |
8. | 10 0 |
+-----------+
After getting rid of the relevant observations you get the desired result:
keep if tag == 0
list
+-----------+
| age tag |
|-----------|
1. | 5 0 |
2. | 5 0 |
3. | 7 0 |
4. | 7 0 |
5. | 10 0 |
|-----------|
6. | 10 0 |
+-----------+

Fill with values from an earlier time point - Stata

I am trying to generate a variable that is filled using a sequence of values starting at time==1.
The sequence changes everytime the variable rest1w changes from 0 to 1 or vice versa.
Firstly, I think I need to generate x, that is where the sequence restarts (see below example dataset). In my example, this is uniform, but in my full dataset the change varies (i.e. it does not change at every 5th observation).
list time restload trainload rest1w x in 1/15
+-----------------------------------------+
| time restload trainload rest1w x |
|-----------------------------------------|
1. | 1 .1994715 .4780615 0 1 |
2. | 2 .2077734 .471063 0 2 |
3. | 3 .2157595 .4641159 0 3 |
4. | 4 .2234298 .4572202 0 4 |
5. | 5 .2307843 .4503757 0 5 |
|-----------------------------------------|
6. | 6 .2378229 .4435827 1 1 |
7. | 7 .2445457 .436841 1 2 |
8. | 8 .2509527 .4301506 1 3 |
9. | 9 .2570438 .4235116 1 4 |
10. | 10 .2628191 .4169239 1 5 |
|-----------------------------------------|
11. | 11 .2682785 .4103876 0 1 |
12. | 12 .2734221 .4039026 0 2 |
13. | 13 .2782499 .397469 0 3 |
14. | 14 .2827618 .3910867 0 4 |
15. | 15 .2869579 .3847558 0 5 |
+-----------------------------------------+
Secondly, I need to generate a variable load. Which as per below shows how I would like to restart from time==1 everytime the sequence restarts. That is, at the second sequence where rest1w==0, load!=trainload.
The rule is that for each new sequence of 0's the value for load again goes back to the start of time (where time==1). This is demonstrated by the load values in the second sequence of 0's being exactly the same as the first sequence. In other words, where time==1, trainload==.478 then load==.478; BUT where time==11, then load==.478 (the clock essentially restarts for load so time==1) and in sequence where time==15, load==.450 (the same load as for where time==5). This is why I wanted to generate x, as I think I could just use that as my new time variable.
+-----------------------------------------+
| time restload trainload rest1w x load
|-----------------------------------------
1. | 1 .1994715 .4780615 0 1 .4780615
2. | 2 .2077734 .471063 0 2 .471063
3. | 3 .2157595 .4641159 0 3 .4641159
4. | 4 .2234298 .4572202 0 4 .4572202
5. | 5 .2307843 .4503757 0 5 .4503757
|-----------------------------------------
6. | 6 .2378229 .4435827 1 1 .1994715
7. | 7 .2445457 .436841 1 2 .2077734
8. | 8 .2509527 .4301506 1 3 .2157595
9. | 9 .2570438 .4235116 1 4 .2234298
10. | 10 .2628191 .4169239 1 5 .2307843
|-----------------------------------------
11. | 11 .2682785 .4103876 0 1 .4780615
12. | 12 .2734221 .4039026 0 2 .471063
13. | 13 .2782499 .397469 0 3 .4641159
14. | 14 .2827618 .3910867 0 4 .4572202
15. | 15 .2869579 .3847558 0 5 .4503757
+-----------------------------------------+
The below code only gives me an entry for where _n==1:
gen load==.
replace load = restload[_n==1] if rest1w==1
And I like the use of levelsof but haven't been able to get it to work (although it might work once I have generated x, but when using time it doesn't restart the sequence obviously).
gen load=.
levelsof x, local(levels)
foreach l of local levels {
replace load=trainload if rest1w==0
replace load=restload if rest1w==1
}
Thanks for any help!
I ended up cross-posting this on statalist.org and got two workable answers.
http://www.statalist.org/forums/forum/general-stata-discussion/general/1355917-fill-with-values-from-an-earlier-time-point
These were:
gen newtime = 1 if rest1w[_n - 1] != rest1w
replace newtime = newtime[_n - 1] + 1 if newtime == .
gen newload = cond(rest1w == 0, trainload[newtime], restload[newtime])
and...
gen newtime = 1
replace newtime = newtime[_n-1] + 1 if rest1w == rest1w[_n-1]
gen newload = .
replace newload = restload[newtime] if rest1w == 1
replace newload = trainload[newtime] if rest1w == 0

get new data from old one SAS

I'm new in SAS
and I have this example :
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
run;
quit;
data
1 2 1
2 3 1
3 5 1
4 4 1
5 8 2
6 6 2
7 4 2
8 2 2
9 2 2
How can I creat new data with only Z=1 and only Z=2 ?
Thank you.
You could use the loc function to subset your data matrix. The following is the description of the function, snipped from Indexing matrices in Introduction to SAS/IML.
the LOC function is often very useful for subsetting vectors and matrices. This function is used for locating elements which meet a given condition. The positions of the elements are returned in row-major order. For vectors, this is simply the position of the element. For matrices, some manipulation is often required in order to use the result of the LOC function as an index. The syntax of the function is:
matrix2=LOC(matrix1=value);
Applied to your example:
proc iml;
x={1 2 3 4 5 6 7 8 9};
y={2,3,5,4,8,6,4,2,2};
z={1,1,1,1,2,2,2,2,2};
data=t(x)||y||z;
print data;
z1rows=loc(data[,3]= 1);
z1=data[z1rows,];
print z1;
z2rows=loc(data[,3]= 2);
z2=data[z2rows,];
print z2;
run;
quit;
The result for print z1;
+------------+
| z1 |
+---+----+---+
| 1 | 2 | 1 |
| 2 | 3 | 1 |
| 3 | 5 | 1 |
| 4 | 4 | 1 |
+---+----+---+
The result for print z2;
+------------+
| z2 |
+---+----+---+
| 5 | 8 | 2 |
| 6 | 6 | 2 |
| 7 | 4 | 2 |
| 8 | 2 | 2 |
| 9 | 2 | 2 |
+---+----+---+

Stata: Maximum number of consecutive occurrences of the same value across variables

Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}
Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
 local lookfor : di _dup(42) "1" 
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+