Creating variables based on other variables in SAS

Creating variables based on other variables in SAS - sas

I'm looking to create a variable based on this data sample:
Video Subject Pre_post Pre_Post_ID
1 1 0 1
1 2 0 1
1 2 0 1
1 3 0 1
1 3 0 1
2 1 1 1
2 1 1 1
2 2 1 1
2 2 1 1
2 3 1 1
4 1 0 2
4 2 0 2
4 2 0 2
4 3 0 2
4 3 0 2
5 1 1 2
5 1 1 2
5 2 1 2
5 2 1 2
5 3 1 2
The goal of the variable will be to create an ID that links the pre_post variable to the subject on the condition that the pre_post_id is the same:
Video Subject Pre_post Pre_Post_ID Subject_P_P_ID
1 1 0 1 1
1 2 0 1 2
1 2 0 1 2
1 3 0 1 3
1 3 0 1 3
2 1 1 1 1
2 1 1 1 1
2 2 1 1 2
2 2 1 1 2
2 3 1 1 3
4 1 0 2 4
4 2 0 2 5
4 2 0 2 5
4 3 0 2 6
4 3 0 2 6
5 1 1 2 4
5 1 1 2 4
5 2 1 2 5
5 2 1 2 5
5 3 1 2 6
Thank you in advance for the help!

You will want to track the pairs (<pre_post_id>,<subject>) as a composite key and increment the Subject_P_P_ID every time a new pair (or key) is encountered.
To simplify the discussion, call the two items in the pair item1 and item2
Here are two ways:
Sort by item1 item2, step through BY item1 item2 and track pair count using logic based on an automatic first. variable -- pair_id + (first.item2), or
Track pairs as keys of a hash and assign new id as <hash>.num_items + 1 when key lookup fails.
Sort + Data Step + Revert Sort
proc sort data=have out=have_sorted;
by item1 item2;
run;
data have_sequenced;
set have_sorted;
by item1 item2;
item1_item2_pair_id + (first.item2);
run;
proc sort data=have_sequenced out=want;
by video subject pre_post pre_post_id item1_item2_pair_id;
run;
Hash
data want;
set have;
if _n_=1 then do;
declare hash lookup();
lookup.defineKeys('item1', 'item2');
lookup.defineData('item1_item2_pair_id');
lookup.defineDone();
end;
if lookup.find() ne 0 then do;
item1_item2_pair_id = lookup.num_items+1;
lookup.add();
end;
end;

Related

Identify and delete observations that do not meet conditions in Stata

I need help identifying and removing observations that meet certain conditions. My data looks like this:
ID caseID set Var1 Var2
1 1 1 1 0
1 2 1 2 0
1 3 1 3 1
1 4 2 1 0
1 5 2 2 0
1 6 2 3 1
2 7 3 1 0
2 8 3 2 0
2 9 3 3 1
2 10 4 1 0
2 11 4 2 0
2 12 4 3 0
For every set, I want to have one observation in which Var2=1 and two observations in which Var2=0. If they do not meet this condition, I want to delete all observations from the set. For example, I would delete set=4 because Var2=0 for all observations. How can I do this in Stata?

Consider the following new variables:
egen count1 = total(Var2 == 1), by(set)
egen count0 = total(Var2 == 0), by(set)
egen total = total(Var2), by(set)
A literal reading of your question implies that you want to
keep if count1 == 1 & count0 == 2
But if sets are always of size 3 and no values other than 0 or 1 are possible, then you need only count1 == 1 OR count0 == 2 OR total == 1 as a condition.

Pandas: count when condition is met in subgroups

I have a dataframe that looks like:
subgroup value
0 1 0
1 1 1
2 1 1
3 1 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 1
9 3 0
10 3 0
I need to add a column that add 1 whenever there is at least one value different than 0 in the different subgroups. Please, note that if the value 1 is repeated more than once in the same subgroup, it doesn't affect the count.
The result should be:
subgroup value count
0 1 0 1
1 1 1 1
2 1 1 1
3 1 1 1
4 2 0 1
5 2 0 1
6 2 0 1
7 3 0 2
8 3 1 2
9 3 0 2
10 3 0 2
Thank you in advance for your help!

Using shift with -1 and 1 and cumsum the result
mask=(df.value.ne(df.value.shift()))&(df.value.ne(df.value.shift(-1)))
mask.cumsum()
Out[18]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 2
9 2
10 2
Name: value, dtype: int32

Using merge and groupby
df.merge(df.groupby('subgroup').value.sum().gt(0).cumsum().reset_index(name='out'))
subgroup value out
0 1 0 1
1 1 1 1
2 1 1 1
3 1 0 1
4 2 0 1
5 2 0 1
6 2 0 1
7 3 0 2
8 3 1 2
9 3 0 2
10 3 0 2

SAS - Split single column into two based value of non-binary ID column

I have data which is as follows:
data have;
length
group 8
replicate $ 1
day 8
observation 8
;
input (_all_) (:);
datalines;
1 A 1 0
1 A 1 5
1 A 1 3
1 A 1 3
1 A 2 7
1 A 2 2
1 A 2 4
1 A 2 2
1 B 1 1
1 B 1 3
1 B 1 8
1 B 1 0
1 B 2 3
1 B 2 8
1 B 2 1
1 B 2 3
1 C 1 1
1 C 1 5
1 C 1 2
1 C 1 7
1 C 2 2
1 C 2 1
1 C 2 4
1 C 2 1
2 A 1 7
2 A 1 5
2 A 1 3
2 A 1 1
2 A 2 0
2 A 2 5
2 A 2 3
2 A 2 0
2 B 1 0
2 B 1 3
2 B 1 4
2 B 1 8
2 B 2 1
2 B 2 3
2 B 2 4
2 B 2 0
2 C 1 0
2 C 1 4
2 C 1 3
2 C 1 1
2 C 2 2
2 C 2 3
2 C 2 0
2 C 2 1
3 A 1 4
3 A 1 5
3 A 1 6
3 A 1 7
3 A 2 3
3 A 2 1
3 A 2 5
3 A 2 2
3 B 1 2
3 B 1 0
3 B 1 2
3 B 1 3
3 B 2 0
3 B 2 6
3 B 2 3
3 B 2 7
3 C 1 7
3 C 1 5
3 C 1 3
3 C 1 1
3 C 2 0
3 C 2 3
3 C 2 2
3 C 2 1
;
run;
I want to split observation into two columns based on day.
observation_ observation_
Obs group replicate day_1 day_2
1 1 A 0 7
2 1 A 5 2
3 1 A 3 4
4 1 A 3 2
5 1 B 1 3
6 1 B 3 8
7 1 B 8 1
8 1 B 0 3
9 1 C 1 2
10 1 C 5 1
11 1 C 2 4
12 1 C 7 1
13 2 A 7 0
14 2 A 5 5
15 2 A 3 3
16 2 A 1 0
17 2 B 0 1
18 2 B 3 3
19 2 B 4 4
20 2 B 8 0
21 2 C 0 2
22 2 C 4 3
23 2 C 3 0
24 2 C 1 1
25 3 A 4 3
26 3 A 5 1
27 3 A 6 5
28 3 A 7 2
29 3 B 2 0
30 3 B 0 6
31 3 B 2 3
32 3 B 3 7
33 3 C 7 0
34 3 C 5 3
35 3 C 3 2
36 3 C 1 1
The observant SO reader will notice that I have asked essentially the same question previously. However, because of SAS's obsession with "levels" and "by groups", since the variable being used to split the variable of interest isn't binary, that solution doesn't generalize.
Trying it directly, the following occurs:
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test;
by
group
replicate
;
var observation;
id day;
run;
ERROR: The ID value "_1" occurs twice in the same BY group.
I can use a LET statement to repress the errors, but in addition to cluttering up the log, SAS retains only the last observation of each BY group.
proc sort data = have out = sorted;
by
group
replicate
;
run;
proc transpose data = sorted out = test let;
by
group
replicate
;
var observation;
id day;
run;
Obs group replicate _NAME_ _1 _2
1 1 A observation 3 2
2 1 B observation 0 3
3 1 C observation 7 1
4 2 A observation 1 0
5 2 B observation 8 0
6 2 C observation 1 1
7 3 A observation 7 2
8 3 B observation 3 7
9 3 C observation 1 1
I don't doubt there's some kludgy way it could be done, such as splitting each group into a separate data set and then re-merging them. It seems like it should be doable with PROC TRANSPOSE, although how escapes me. Any ideas?

Not sure what you're talking about with "SAS's obsession...", but the issue here is fairly straightforward; you need to tell SAS about the four rows (or whatever) being separate, distinct rows. by tells SAS what the row-level ID is, but you're lying to it when you say by group replicate, since there are still multiple rows under that. So you need to have a unique key. (This would be true in any database-like language, nothing unique to SAS here. )
I would do this - make a day_row field, then sort by that.
data have_id;
set have;
by group replicate day;
if first.day then day_row = 0;
day_row+1;
run;
proc sort data=have_id;
by group replicate day_row;
run;
proc transpose data=have_id out=want(drop=_name_) prefix=observation_day_;
by group replicate day_row;
var observation;
id day;
run;

Your output looks like you don't want to transpose the data but instead just want split it into DAY1 and DAY2 sets and merge them back together. This will just pair the multiple readings per BY group in the same order that they appear, which is what it looks like you did in your example.
data want ;
merge
have(where=(day=1) rename=(observation=day_1))
have(where=(day=2) rename=(observation=day_2))
;
by group replicate;
drop day ;
run;
You can read the source data as many times as you need for the number of values of DAY.
If you think that you might not have the same number of observations per BY group for each DAY then you should add these statements at the end of the data step.
output;
call missing(of day_:);

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end

For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.

One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

Stata: Capture p-value from ranksum test

When I run return list, all after running a ranksum test, the count and z-score are available, but not the p-value. Is there any way of picking it up?
clear
input eventtime prefflag winner stakechange
1 1 1 10
1 2 1 5
2 1 0 50
2 2 0 31
2 1 1 51
2 2 1 20
1 1 0 10
2 2 1 10
2 1 0 5
3 2 0 8
4 2 0 8
5 2 0 8
5 2 1 8
3 1 1 8
4 1 1 8
5 1 1 8
5 1 1 8
end
bysort eventtime winner: tabstat stakechange, stat(mean median n) columns(statistics)
ranksum stakechange if inlist(eventtime, 1, 2) & inlist(winner, 0, .), by (eventtime)
return list, all

Try computing it after ranksum:
scalar pval = 2 * normprob(-abs(r(z)))
display pval
The answer is by #NickCox:
http://www.stata.com/statalist/archive/2004-12/msg00622.html
The Statalist archive is a valuable resource.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating variables based on other variables in SAS - sas

Related

Identify and delete observations that do not meet conditions in Stata

Pandas: count when condition is met in subgroups

SAS - Split single column into two based value of non-binary ID column

Create a dummy variable for the last rows based on on another variable

Stata: Capture p-value from ranksum test

Categories

Resources