Identify and delete observations that do not meet conditions in Stata - stata

I need help identifying and removing observations that meet certain conditions. My data looks like this:
ID caseID set Var1 Var2
1 1 1 1 0
1 2 1 2 0
1 3 1 3 1
1 4 2 1 0
1 5 2 2 0
1 6 2 3 1
2 7 3 1 0
2 8 3 2 0
2 9 3 3 1
2 10 4 1 0
2 11 4 2 0
2 12 4 3 0
For every set, I want to have one observation in which Var2=1 and two observations in which Var2=0. If they do not meet this condition, I want to delete all observations from the set. For example, I would delete set=4 because Var2=0 for all observations. How can I do this in Stata?

Consider the following new variables:
egen count1 = total(Var2 == 1), by(set)
egen count0 = total(Var2 == 0), by(set)
egen total = total(Var2), by(set)
A literal reading of your question implies that you want to
keep if count1 == 1 & count0 == 2
But if sets are always of size 3 and no values other than 0 or 1 are possible, then you need only count1 == 1 OR count0 == 2 OR total == 1 as a condition.

Related

Creating variables based on other variables in SAS

I'm looking to create a variable based on this data sample:
Video Subject Pre_post Pre_Post_ID
1 1 0 1
1 2 0 1
1 2 0 1
1 3 0 1
1 3 0 1
2 1 1 1
2 1 1 1
2 2 1 1
2 2 1 1
2 3 1 1
4 1 0 2
4 2 0 2
4 2 0 2
4 3 0 2
4 3 0 2
5 1 1 2
5 1 1 2
5 2 1 2
5 2 1 2
5 3 1 2
The goal of the variable will be to create an ID that links the pre_post variable to the subject on the condition that the pre_post_id is the same:
Video Subject Pre_post Pre_Post_ID Subject_P_P_ID
1 1 0 1 1
1 2 0 1 2
1 2 0 1 2
1 3 0 1 3
1 3 0 1 3
2 1 1 1 1
2 1 1 1 1
2 2 1 1 2
2 2 1 1 2
2 3 1 1 3
4 1 0 2 4
4 2 0 2 5
4 2 0 2 5
4 3 0 2 6
4 3 0 2 6
5 1 1 2 4
5 1 1 2 4
5 2 1 2 5
5 2 1 2 5
5 3 1 2 6
Thank you in advance for the help!
You will want to track the pairs (<pre_post_id>,<subject>) as a composite key and increment the Subject_P_P_ID every time a new pair (or key) is encountered.
To simplify the discussion, call the two items in the pair item1 and item2
Here are two ways:
Sort by item1 item2, step through BY item1 item2 and track pair count using logic based on an automatic first. variable -- pair_id + (first.item2), or
Track pairs as keys of a hash and assign new id as <hash>.num_items + 1 when key lookup fails.
Sort + Data Step + Revert Sort
proc sort data=have out=have_sorted;
by item1 item2;
run;
data have_sequenced;
set have_sorted;
by item1 item2;
item1_item2_pair_id + (first.item2);
run;
proc sort data=have_sequenced out=want;
by video subject pre_post pre_post_id item1_item2_pair_id;
run;
Hash
data want;
set have;
if _n_=1 then do;
declare hash lookup();
lookup.defineKeys('item1', 'item2');
lookup.defineData('item1_item2_pair_id');
lookup.defineDone();
end;
if lookup.find() ne 0 then do;
item1_item2_pair_id = lookup.num_items+1;
lookup.add();
end;
end;

SAS - Split single column into two based value of non-binary ID column

I have data which is as follows:
data have;
length
group 8
replicate $ 1
day 8
observation 8
;
input (_all_) (:);
datalines;
1 A 1 0
1 A 1 5
1 A 1 3
1 A 1 3
1 A 2 7
1 A 2 2
1 A 2 4
1 A 2 2
1 B 1 1
1 B 1 3
1 B 1 8
1 B 1 0
1 B 2 3
1 B 2 8
1 B 2 1
1 B 2 3
1 C 1 1
1 C 1 5
1 C 1 2
1 C 1 7
1 C 2 2
1 C 2 1
1 C 2 4
1 C 2 1
2 A 1 7
2 A 1 5
2 A 1 3
2 A 1 1
2 A 2 0
2 A 2 5
2 A 2 3
2 A 2 0
2 B 1 0
2 B 1 3
2 B 1 4
2 B 1 8
2 B 2 1
2 B 2 3
2 B 2 4
2 B 2 0
2 C 1 0
2 C 1 4
2 C 1 3
2 C 1 1
2 C 2 2
2 C 2 3
2 C 2 0
2 C 2 1
3 A 1 4
3 A 1 5
3 A 1 6
3 A 1 7
3 A 2 3
3 A 2 1
3 A 2 5
3 A 2 2
3 B 1 2
3 B 1 0
3 B 1 2
3 B 1 3
3 B 2 0
3 B 2 6
3 B 2 3
3 B 2 7
3 C 1 7
3 C 1 5
3 C 1 3
3 C 1 1
3 C 2 0
3 C 2 3
3 C 2 2
3 C 2 1
;
run;
I want to split observation into two columns based on day.
observation_ observation_
Obs group replicate day_1 day_2
1 1 A 0 7
2 1 A 5 2
3 1 A 3 4
4 1 A 3 2
5 1 B 1 3
6 1 B 3 8
7 1 B 8 1
8 1 B 0 3
9 1 C 1 2
10 1 C 5 1
11 1 C 2 4
12 1 C 7 1
13 2 A 7 0
14 2 A 5 5
15 2 A 3 3
16 2 A 1 0
17 2 B 0 1
18 2 B 3 3
19 2 B 4 4
20 2 B 8 0
21 2 C 0 2
22 2 C 4 3
23 2 C 3 0
24 2 C 1 1
25 3 A 4 3
26 3 A 5 1
27 3 A 6 5
28 3 A 7 2
29 3 B 2 0
30 3 B 0 6
31 3 B 2 3
32 3 B 3 7
33 3 C 7 0
34 3 C 5 3
35 3 C 3 2
36 3 C 1 1
The observant SO reader will notice that I have asked essentially the same question previously. However, because of SAS's obsession with "levels" and "by groups", since the variable being used to split the variable of interest isn't binary, that solution doesn't generalize.
Trying it directly, the following occurs:
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test;
by
group
replicate
;
var observation;
id day;
run;
ERROR: The ID value "_1" occurs twice in the same BY group.
I can use a LET statement to repress the errors, but in addition to cluttering up the log, SAS retains only the last observation of each BY group.
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test let;
by
group
replicate
;
var observation;
id day;
run;
Obs group replicate _NAME_ _1 _2
1 1 A observation 3 2
2 1 B observation 0 3
3 1 C observation 7 1
4 2 A observation 1 0
5 2 B observation 8 0
6 2 C observation 1 1
7 3 A observation 7 2
8 3 B observation 3 7
9 3 C observation 1 1
I don't doubt there's some kludgy way it could be done, such as splitting each group into a separate data set and then re-merging them. It seems like it should be doable with PROC TRANSPOSE, although how escapes me. Any ideas?
Not sure what you're talking about with "SAS's obsession...", but the issue here is fairly straightforward; you need to tell SAS about the four rows (or whatever) being separate, distinct rows. by tells SAS what the row-level ID is, but you're lying to it when you say by group replicate, since there are still multiple rows under that. So you need to have a unique key. (This would be true in any database-like language, nothing unique to SAS here. )
I would do this - make a day_row field, then sort by that.
data have_id;
set have;
by group replicate day;
if first.day then day_row = 0;
day_row+1;
run;
proc sort data=have_id;
by group replicate day_row;
run;
proc transpose data=have_id out=want(drop=_name_) prefix=observation_day_;
by group replicate day_row;
var observation;
id day;
run;
Your output looks like you don't want to transpose the data but instead just want split it into DAY1 and DAY2 sets and merge them back together. This will just pair the multiple readings per BY group in the same order that they appear, which is what it looks like you did in your example.
data want ;
merge
have(where=(day=1) rename=(observation=day_1))
have(where=(day=2) rename=(observation=day_2))
;
by group replicate;
drop day ;
run;
You can read the source data as many times as you need for the number of values of DAY.
If you think that you might not have the same number of observations per BY group for each DAY then you should add these statements at the end of the data step.
output;
call missing(of day_:);

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

If current week has missing value, how to replace it with the value from previous week?

I have a dataset that shows how much was paid ("cenoz" - cents per ounce) per product category during specific week and in a specific store.
clear
set more off
input week store cenoz category
1 1 2 1
1 1 4 2
1 1 3 3
1 2 5 1
1 2 7 2
1 2 8 3
2 1 4 1
2 1 1 2
2 1 10 3
2 2 3 1
2 2 4 2
2 2 7 3
3 1 5 1
3 1 3 2
3 2 5 1
3 2 4 2
end
I create a new variable cenoz3 that indicates how much on average was paid for category 3 given specific week and a store. Same with cenoz1, and cenoz2.
egen cenoz1 = mean(cenoz/ (category == 1)), by(week store)
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store)
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)
It turns out that category 3 was not sold in any of the stores (1 and 2) in week 3. As a result, missing values are generated.
week store cenoz category cenoz1 cenoz2 cenoz3
1 1 2 1 2 4 3
1 1 4 2 2 4 3
1 1 3 3 2 4 3
1 2 5 1 5 7 8
1 2 7 2 5 7 8
1 2 8 3 5 7 8
2 1 4 1 4 1 10
2 1 1 2 4 1 10
2 1 10 3 4 1 10
2 2 3 1 3 4 7
2 2 4 2 3 4 7
2 2 7 3 3 4 7
3 1 5 1 5 3 .
3 1 3 2 5 3 .
3 2 5 1 5 4 .
3 2 4 2 5 4 .
I would like to replace missing values of a particular week with values of the previous week and matching store. That's to say:
replace missing values for category 3 in week 3 in store 1
with values for category 3 in week 2 in store 1
and
replace missing values for category 3 in week 3 in store 2
with values for category 3 in week 2 in store 2
Can I use command replace or is it something more complicated than that?
Something like:
replace cenoz1 = cenoz1[_n-1] if missing(cenoz1)
But I also need to the stores to match, not just the time variable week.
I found this code provided by Nicholas Cox at
http://www.stata.com/support/faqs/data-management/replacing-missing-values/:
by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
Do you think
by store (week), sort: cenoz1 = cenoz1[_n-1] if missing(cenoz1)
is sufficient?
UPDATE:
When I use the code
by store (week category), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
It seems it delivers correct values:
week store cenoz category cenoz1 cenoz2 cenoz3
1 1 2 1 2 4 3
1 1 4 2 2 4 3
1 1 3 3 2 4 3
1 2 5 1 5 7 8
1 2 7 2 5 7 8
1 2 8 3 5 7 8
2 1 4 1 4 1 10
2 1 1 2 4 1 10
2 1 10 3 4 1 10
2 2 3 1 3 4 7
2 2 4 2 3 4 7
2 2 7 3 3 4 7
3 1 5 1 5 3 10
3 1 3 2 5 3 10
3 2 5 1 5 4 7
3 2 4 2 5 4 7
Is there any way to double check this code given that my dataset is quite large?
How make this code not so specific but applicable to any missing cenoz if it finds one with missing vaues? (cenoz1, cenoz2, cenoz3, cenoz4...cenoz12)
If you want to use the previous information for the same store and the same category, that should be
by store category (week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
A generalization could be
sort store category week
forval j = 1/12 {
by store category: replace cenoz`j' = cenoz`j'[_n-1] if missing(cenoz`j')
}
However this carrying forward is a fairly crude method of interpolation. Consider linear, cubic, cubic spline, PCHIP methods of interpolation. Use search to find Stata programs.
A quick note on why your code
by store (category week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
won't work.
It will work for the example dataset you give. But a slight modification can give unexpected results. Consider the following example:
clear all
set more off
input week store cenoz category
1 1 2 1
1 1 4 2 /*
1 1 3 3 deleted observation */
1 2 5 1
1 2 7 2
1 2 8 3
2 1 4 1
2 1 1 2
2 1 10 3
2 2 3 1
2 2 4 2
2 2 7 3
3 1 5 1
3 1 3 2
3 1 999 3 // new observation
3 2 5 1
3 2 4 2
end
egen cenoz1 = mean(cenoz/ (category == 1)), by(week store)
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store)
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)
order store category week
sort store category week
list, sepby(store category)
*----- method 1 (your code) -----
gen cenoz3x1 = cenoz3
by store (category week), sort: replace cenoz3x1 = cenoz3x1[_n-1] if missing(cenoz3x1)
*----- method 2 (Nick's code) -----
gen cenoz3x2 = cenoz3
by store category (week), sort: replace cenoz3x2 = cenoz3x2[_n-1] if missing(cenoz3x2)
list, sepby(store category)
Method 1 will assign the price of a category 1 article to a category 2 article (observation 4 of cenoz3x1). Presumably, something you don't want. If you want to avoid this, then the groups should be based on store category and not just store.
The best place to start reading is help and the manuals.

Stata: Capture p-value from ranksum test

When I run return list, all after running a ranksum test, the count and z-score are available, but not the p-value. Is there any way of picking it up?
clear
input eventtime prefflag winner stakechange
1 1 1 10
1 2 1 5
2 1 0 50
2 2 0 31
2 1 1 51
2 2 1 20
1 1 0 10
2 2 1 10
2 1 0 5
3 2 0 8
4 2 0 8
5 2 0 8
5 2 1 8
3 1 1 8
4 1 1 8
5 1 1 8
5 1 1 8
end
bysort eventtime winner: tabstat stakechange, stat(mean median n) columns(statistics)
ranksum stakechange if inlist(eventtime, 1, 2) & inlist(winner, 0, .), by (eventtime)
return list, all
Try computing it after ranksum:
scalar pval = 2 * normprob(-abs(r(z)))
display pval
The answer is by #NickCox:
http://www.stata.com/statalist/archive/2004-12/msg00622.html
The Statalist archive is a valuable resource.