How to recognise a particular sequence in a dataset and mark it? - sas

How to recognize the first "1,0" sequence in column "Flag" from each group and mark a "1" just like it in column "Flag2"?
ID Flag Flag2
1 1
1 1 1
1 0
1 1
1 0
1 0
2 1
2 1
2 1
2 1 1
2 0
2 0
3 0
3 0
3 0
3 0
4 1
4 1 1
4 0
4 1

The problem requires using a 'lead' concept (value from next row) similar to the lag concept provided by the lag function. There is no built in lead function so you need to be creative.
Merge the data to itself, without a by statement, where the second version is:
Offset by one row by the firstobs data set option
Renames the variables so the lead state can be established with an if
A retained variable tracks if the 1,0 transition has been observed within the group.
Sample code:
data have;input
ID Flag; datalines;
1 1
1 1
1 0
1 1
1 0
1 0
2 1
2 1
2 1
2 1
2 0
2 0
3 0
3 0
3 0
3 0
4 1
4 1
4 0
4 1
run;
data want;
merge
have
have(firstobs=2 rename=(id=lead_id flag=lead_flag))
;
retain flagged_id;
if (id=lead_id) /* lead is in same group */
and (flag=1) and (lead_flag=0) /* transition identified */
and (flagged_id ne id) then /* first such transition for group */
do;
flag2=1; /* flag the lead transition */
flagged_id = id; /* track id where transition last flagged */
end;
drop lead_: flagged:;
run;

Related

Lag function in SAS for checking previous value

In SAS, I would like to create a label that check the previous sell indicator: if the sell indicator of the previous time period is 1/0 and in the current is 0/1 (meaning that it has changed) then I assign a value 1 to the ind variable.
The dataset looks like:
Customer Time Sell_Ind
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
And so on.
My expected output would be
Customer Time Sell_Ind Ind
1 2 1 0
1 3 0 1
1 4 0 0
2 23 0 0
2 24 0 0
2 30 0 0
5 12 1 0
5 11 0 1
The previous/current check is meant by customer.
I have tried as follows
data mydata;
set original;
By customer;
Lag_sell_ind=lag(sell_ind);
If first.customer then Lag_sell_ind=.;
Run;
But it does not return the expected output.
In sql I would probably use partition by customer over time but I do not know how to do the same in SAS.
You were halfway through, you only need to add one if statement to achieve the desired output.
data want;
set have;
by customer;
lag=lag(sell_ind);
if first.customer then lag=.;
if sell_ind ne lag and lag ne . then ind = 1;
else ind = 0;
drop lag;
run;
You can simplify this using the IFN Function like below.
data have;
input Customer Time Sell_Ind;
datalines;
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
;
data want;
set have;
by customer;
Lag_sell_ind = ifn(first.customer, 0, lag(sell_ind));
Run;

Identify and delete observations that do not meet conditions in Stata

I need help identifying and removing observations that meet certain conditions. My data looks like this:
ID caseID set Var1 Var2
1 1 1 1 0
1 2 1 2 0
1 3 1 3 1
1 4 2 1 0
1 5 2 2 0
1 6 2 3 1
2 7 3 1 0
2 8 3 2 0
2 9 3 3 1
2 10 4 1 0
2 11 4 2 0
2 12 4 3 0
For every set, I want to have one observation in which Var2=1 and two observations in which Var2=0. If they do not meet this condition, I want to delete all observations from the set. For example, I would delete set=4 because Var2=0 for all observations. How can I do this in Stata?
Consider the following new variables:
egen count1 = total(Var2 == 1), by(set)
egen count0 = total(Var2 == 0), by(set)
egen total = total(Var2), by(set)
A literal reading of your question implies that you want to
keep if count1 == 1 & count0 == 2
But if sets are always of size 3 and no values other than 0 or 1 are possible, then you need only count1 == 1 OR count0 == 2 OR total == 1 as a condition.

Creating variables based on other variables in SAS

I'm looking to create a variable based on this data sample:
Video Subject Pre_post Pre_Post_ID
1 1 0 1
1 2 0 1
1 2 0 1
1 3 0 1
1 3 0 1
2 1 1 1
2 1 1 1
2 2 1 1
2 2 1 1
2 3 1 1
4 1 0 2
4 2 0 2
4 2 0 2
4 3 0 2
4 3 0 2
5 1 1 2
5 1 1 2
5 2 1 2
5 2 1 2
5 3 1 2
The goal of the variable will be to create an ID that links the pre_post variable to the subject on the condition that the pre_post_id is the same:
Video Subject Pre_post Pre_Post_ID Subject_P_P_ID
1 1 0 1 1
1 2 0 1 2
1 2 0 1 2
1 3 0 1 3
1 3 0 1 3
2 1 1 1 1
2 1 1 1 1
2 2 1 1 2
2 2 1 1 2
2 3 1 1 3
4 1 0 2 4
4 2 0 2 5
4 2 0 2 5
4 3 0 2 6
4 3 0 2 6
5 1 1 2 4
5 1 1 2 4
5 2 1 2 5
5 2 1 2 5
5 3 1 2 6
Thank you in advance for the help!
You will want to track the pairs (<pre_post_id>,<subject>) as a composite key and increment the Subject_P_P_ID every time a new pair (or key) is encountered.
To simplify the discussion, call the two items in the pair item1 and item2
Here are two ways:
Sort by item1 item2, step through BY item1 item2 and track pair count using logic based on an automatic first. variable -- pair_id + (first.item2), or
Track pairs as keys of a hash and assign new id as <hash>.num_items + 1 when key lookup fails.
Sort + Data Step + Revert Sort
proc sort data=have out=have_sorted;
by item1 item2;
run;
data have_sequenced;
set have_sorted;
by item1 item2;
item1_item2_pair_id + (first.item2);
run;
proc sort data=have_sequenced out=want;
by video subject pre_post pre_post_id item1_item2_pair_id;
run;
Hash
data want;
set have;
if _n_=1 then do;
declare hash lookup();
lookup.defineKeys('item1', 'item2');
lookup.defineData('item1_item2_pair_id');
lookup.defineDone();
end;
if lookup.find() ne 0 then do;
item1_item2_pair_id = lookup.num_items+1;
lookup.add();
end;
end;

Conditionally delete the most recently inserted observation in SAS

I have two tables A and B that look like below.
Table A
rowno flag1 flag2 flag3
1 1 0 0
2 0 1 1
3 0 0 0
4 0 1 1
5 0 0 1
6 0 0 0
7 0 0 0
8 0 1 0
9 0 0 0
10 1 0 0
Table B
rowno flag1 flag2 flag3
Table A and B have the same column names but B is an empty table initially.
So what I want to accomplish is to insert the values from A to B row by row using macro, iteration by rowno. And each time I insert one row from A to B, I want to calculate the sum of each flag column.
If after insert each row, the sum(flag1) > 1 or sum(flag2) >1 or sum(flag3) >1, I need to delete that inserted row from table B. Then the iteration keeps running till the end of the observation in Table A. The final output in Table B is to have 5 observations from table A.
the code I have so far is below:
%macro iteration;
%do rowno=1 %to 10;
proc sql;
insert into table.B
select *
from table.A
where rowno = &rowno;
quit;
set table.B;
if
sum(flag1) > 1
or
sum(flag2) > 1
or
sum(flag3) > 1
then delete;
run;
%end;
%mend iteration;
%iteration
I received a lot of error messages.
Looking forward to your help and suggestions. Thanks.
The ideal output data would look like this
rowno flag1 flag2 flag3
1 1 0 0
2 0 1 1
3 0 0 0
6 0 0 0
7 0 0 0
Instead of a macro, use a running sum to calculate the running sum of each row. If you need to delete a row remember to reverse the increment to the running sum. Based on your data, I think Row 9 should also be kept.
data TableA;
input rowno flag1 flag2 flag3;
cards;
1 1 0 0
2 0 1 1
3 0 0 0
4 0 1 1
5 0 0 1
6 0 0 0
7 0 0 0
8 0 1 0
9 0 0 0
10 1 0 0
;
run;
data TableB;
set TableA;
retain sum_:;
*Increment running sum for flag;
sum_flag1+flag1;
sum_flag2+flag2;
sum_flag3+flag3;
*Check flag amounts;
if sum_flag1>1 or sum_flag2>1 or sum_flag3>1 then do;
*if flag is tripped then delete increment to flag and remove record;
sum_flag1 +-flag1;
sum_flag2 +-flag2;
sum_flag3 +-flag3;
delete;
end;
run;

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count