My table has some leading and trailing observations that I am trying to remove. I want to remove the rows that come before every 'begin' event and after every 'end' event for every single group. The table resembles the below:
| Time | Group | Event | Value |
| 1 | 1 | NA | 0 |
| 2 | 1 | NA | 0 |
| 3 | 1 | Begin | 1.1 |
| 4 | 1 | NA | 1.2 |
| 5 | 1 | NA | 1.3 |
| 6 | 1 | End | 1.4 |
| 7 | 1 | NA | 0 |
| 1 | 2 | NA | 0 |
| 2 | 2 | Begin | 1.1 |
| 3 | 2 | NA | 1.2 |
| 4 | 2 | End | 1.3 |
| 5 | 2 | NA | 1.4 |
On the presumption that the incoming data is already sorted and that there are zero or more serially bounded ranges of Begin to End within each group:
data want;
do until (last.group);
set have;
by group time;
if event = 'Begin' then _keeprow = 1;
if _keeprow then output;
if event = 'End' then _keeprow = 0;
end;
drop _keeprow;
end;
I have came out an easy way but will be limited by the actual data size.
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
run;
proc sort data = have;
by group time;
run;
data have1;
set have;
count + 1;
by group;
if first.group then count = -100;
if event = 'Begin' then count = 0;
if event = 'End' then count = 100;
if count < 0 or count >100 then delete;
run;
The current code could be applied to the small size data if you have less than 100 observations between 'Begin' and 'End' and less than 100 observations before 'Begin'. You can adjust the initial count value according to the true data size.
one way to do is
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
data have2(keep= Group min_var max_var);
set have;
by group;
retain min_var max_var;
if trim(Event)= "Begin" then min_var =_n_ ;
if trim(Event)= "End" then max_var =_n_;
if last.group;
run;
data want;
merge have have2;
by group;
if _n_ ge min_var and _n_ le max_var ;
drop min_var max_var;
run;
Related
I have table in SAS Enterprise Guide like below:
ID | COL1 | VAL1 |
----|------|------|
111 | A | 10 |
111 | A | 5 |
111 | B | 10 |
222 | B | 20 |
333 | C | 25 |
... | ... | ... |
And I need to aggregate above table to know:
sum of values from COL1 per ID
sum of values from VAL1 per COL1 per ID
So, as a result I need something like below:
ID | COL1_A | COL1_B | COL1_C | COL1_A_VAL1_SUM | COL1_B_VAL1_SUM | COL1_C_VAL1_SUM
----|--------|--------|---------|-----------------|-----------------|------------------
111 | 2 | 1 | 0 | 15 | 10 | 0
222 | 0 | 1 | 0 | 0 | 20 | 0
333 | 0 | 0 | 1 | 0 | 0 | 25
for example because:
COL1_A = 2 for ID 111, because ID=111 has 2 times "A" in COL1
COL1_A_VAL1_SUM = 15 for ID 111, because ID=111 has 10+5=15 in VAL1 for "A" in COL1
How can I do that in SAS Enterpriuse Guide or in PROC SQL ?
First, we'll create the counts that we need by group with SQL:
proc sql;
create table totals_by_group as
select id
, col1
, count(col1) as count_col1
, sum(val1) as sum_val1
from have
group by id, col1
;
quit;
This produces the following table:
id col1 count_col1 sum_val1
111 A 2 15
111 B 1 10
222 B 1 20
333 C 1 25
Now we need to transpose this into the way we want it. We'll do this with two transpose steps: one for count_col1, and one for sum_val1. proc transpose has a few handy options to make this easy, namely the id, prefix, and suffix options.
First, we'll consider our ID variable col1. This creates columns named A, B, and C. For example:
id A B C
111 2 1 .
222 . 1 .
333 . . 1
The prefix and suffix options let us add a prefix and suffix to these names.
proc transpose
data = totals_by_group
out = count_by_group(drop=_NAME_)
prefix = COL1_;
by id;
id col1;
var count_col1;
run;
proc transpose
data = totals_by_group
out = sum_by_group(drop=_NAME_)
prefix = COL1_
suffix = _VAL1_SUM;
by id;
id col1;
var sum_val1;
run;
This gives us two tables:
COUNT_BY_GROUP
id COL1_A COL1_B COL1_C
111 2 1 .
222 . 1 .
333 . . 1
SUM_BY_GROUP
id COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 15 10 .
222 . 20 .
333 . . 25
Now we just need to merge them together, then set all missing values to 0 by iterating over each numeric column and checking if it's missing.
data want;
merge count_by_group
sum_by_group
;
by id;
array numvars[*] _NUMERIC_;
do i = 1 to dim(numvars);
if(missing(numvars[i])) then numvars[i] = 0;
end;
drop i;
run;
Final table:
id COL1_A COL1_B COL1_C COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 2 1 0 15 10 0
222 0 1 0 0 20 0
333 0 0 1 0 0 25
I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36
I am using a dataset with about 100 variables and 1000 rows, similar to the one below:
. var1 var2 var3 var4
AL 10 11 12 13
AK -1 0 0 18
AZ 5 -5 -2 22
VA 15 16 0 0
How can I list the variables / observations that have a negative value?
For example, I would like to list that AK has negative var1 and AZ has negative var2 and var3.
Here's an example of how you can create a marker variable for each of your var variables:
clear
input str2 state var1 var2 var3 var4
AL 10 11 12 13
AK -1 0 0 18
AZ 5 -5 -2 22
VA 15 16 0 0
end
foreach var in var1 var2 var3 var4 {
generate tag_`var' = `var' < 0
}
list
+-------------------------------------------------------------------------------+
| state var1 var2 var3 var4 tag_var1 tag_var2 tag_var3 tag_var4 |
|-------------------------------------------------------------------------------|
1. | AL 10 11 12 13 0 0 0 0 |
2. | AK -1 0 0 18 1 0 0 0 |
3. | AZ 5 -5 -2 22 0 1 1 0 |
4. | VA 15 16 0 0 0 0 0 0 |
+-------------------------------------------------------------------------------+
You can then do the following:
list state var1 if tag_var1 == 1
+--------------+
| state var1 |
|--------------|
2. | AK -1 |
+--------------+
or
list state var* if tag_var1 == 1 | tag_var2 == 1 | tag_var3 == 1 | tag_var4 == 1
+-----------------------------------+
| state var1 var2 var3 var4 |
|-----------------------------------|
2. | AK -1 0 0 18 |
3. | AZ 5 -5 -2 22 |
+-----------------------------------+
If you do not need the extra flexibility of a marker variable you can simply do:
list state var1 if var1 < 0
EDIT:
Alternatively you could do the following:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list state var obsno value if value < 0, noobs sepby(state)
+------------------------------+
| state var obsno value |
|------------------------------|
| AK var1 2 -1 |
|------------------------------|
| AZ var2 3 -5 |
| AZ var3 3 -2 |
+------------------------------+
restore
There are two other techniques that can be mentioned. One is to calculate the minimum in each observation (row) and then list if and only if that minimum is negative. That way, you get any zeros, positives and missings too in the same observations.
The other is just to loop over the variables and list separately.
clear
input str2 state var1 var2 var3 var4
AL 10 11 12 13
AK -1 0 0 18
AZ 5 -5 -2 22
VA 15 16 0 0
end
egen min = rowmin(var*)
list if min < 0
+-----------------------------------------+
| state var1 var2 var3 var4 min |
|-----------------------------------------|
2. | AK -1 0 0 18 -1 |
3. | AZ 5 -5 -2 22 -5 |
+-----------------------------------------+
foreach v of var var* {
quietly count if `v' < 0
if r(N) list `v' if `v' < 0
}
+------+
| var1 |
|------|
2. | -1 |
+------+
+------+
| var2 |
|------|
3. | -5 |
+------+
+------+
| var3 |
|------|
3. | -2 |
+------+
I have the following purchasing data
clear
input id productid purchase
1 1 1
2 1 1
3 2 1
1 3 1
end
I want to add a row for every id-productid combo to create the following dataset
id productid purchase
1 1 1
2 1 1
3 1 0
1 2 0
2 2 0
3 2 1
1 3 1
2 3 0
3 3 0
end
I have tried a lot that has not work. This is my latest.
qui sum id, d
local obs = r(N)
expand = `obs'
levelsof productid, local(id)
local j = 1
foreach i of local id {
replace productid = `i' if `j' == id
local j = `j' + 1
}
The fillin command (see help fillin) is the tool for this task.
Starting with your sample data in memory:
fillin id productid
replace purchase = 0 if _fillin
drop _fillin
sort productid id
list, sepby(productid) abbreviate(12)
produces
+---------------------------+
| id productid purchase |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 1 |
3. | 3 1 0 |
|---------------------------|
4. | 1 2 0 |
5. | 2 2 0 |
6. | 3 2 1 |
|---------------------------|
7. | 1 3 1 |
8. | 2 3 0 |
9. | 3 3 0 |
+---------------------------+
Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}
Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
local lookfor : di _dup(42) "1"
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+