Last Changed row +1 in a group - sas

I've got group data and it has flags created anytime a name is changed within that group. I can pull the last two or first two observations within the group, but I am struggling figuring out how to pull the last observation with a name change AND the row right after.
The below code give me the first or last two observations per group, depending on how I sort the data.
DATA LastTwo;
SET WhatIveGot;
count + 1;
BY group_ID /*data pre sorted*/;
IF FIRST.group_ID THEN count=1;
IF count<=2 THEN OUTPUT;
RUN;
What I need is to be the LAST observation with a name change AND the following row.
group_ID NAME DATE NAME_CHange
1 TOM 1/1/19 0
1 Jill 1/30/19 1
1 Jill 1/20/19 0
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 TOM 2/1/19 0
2 Jill 2/30/19 1
2 Jill 2/20/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
2 Jim 4/15/19 0
3 Joe 2/20/19 0
3 Kim 3/10/19 1
3 Kim 3/30/19 0
3 Ken 4/15/19 1
4 Tim 3/10/19 0
4 Tim 3/30/19 0
The desired output:
group_ID NAME DATE NAME_CHange
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
3 Ken 4/15/19 1
The cases for Group_ID 2 and 3 are the roadblock. The data is already sorted by date.
Thank you for any help in advance

Use DOW processing to determine where the last name change was. Apply that information in a succeeding loop.
Example:
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id name notsorted;
if first.name then _index_of_last_name_change = _n_;
end;
do _n_ = 1 to _n_;
set have;
if _index_of_last_name_change <= _n_ <= _index_of_last_name_change+1 then OUTPUT;
end;
drop _:;
run;

Related

Create a running counter based on ID and date

I have 3 variables and a counter has to be created based on them.
Input:
ID window start window end
1 29oct20 12mar21
1 31oct20 08Feb21
1 31oct21 08feb21
1 31oct21 08feb21
2 06Nov20 11Apr21
2 06Nov20 11Apr21
2 27Nov20 01Apr19
Expected output:
ID window start window end priority_count
1 29oct20 12mar21 1
1 31oct20 08Feb21 2
1 31oct21 08feb21 2
1 31oct21 08feb21 2
2 06Nov20 11Apr21 1
2 06Nov20 11Apr21 1
2 27Nov20 01Apr19 2
So for every ID a new count should start once a new date comes.
I have been using this code
data want;
set have;
by ID window_start window_end;
if first.ID and first.window_start and first.window_endthen priority_count=1;
else priority_count+1;
run;
But it gives:
priority_count
1
2
3
4
1
2
3
Not sure if those are typos but there are several observations for which window_start is after window_end.
Using the LAG function
data want;
set have;
by id;
_lag=lag(window_start);
if first.id then priority_count=1;
else do;
if window_start ne _lag then
priority_count + 1;
end;
drop _lag;
run;
ID window_start window_end priority_count
1 29OCT2020 12MAR2021 1
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
2 06NOV2020 11APR2021 1
2 06NOV2020 11APR2021 1
2 27NOV2020 01APR2019 2
I think you're on the right track but need a slight modifications on your IF statements to reflect the logic.
Set to 0 at first of each ID
Increment if the window_end changes (or window_start since they're consistent in your example). Setting it to 0 initially means you can increment without worrying if it's the first or not.
data want;
set have;
by ID window_start window_end;
if first.ID then priority_count=0;
if first.window_end then priority_count+1;
run;

How do I find first row of last group in SAS, where ordering matters?

I'd like to ask help in this, as I am new to SAS, but a PROC SQL approach is usable as well.
My dataset has IDs, a time variable, and a flag. After I sort by id and time, I need to find the first flagged observation of the last flagged group/streak. As in:
ID TIME FLAG
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
Here I want my script to return the row where time is 8 for ID 1, as it is the first observation from the last "streak", or flagged group. For ID 2 it should be where time is 3.
Desired output:
ID TIME FLAG
1 8 1
2 3 1
I'm trying to wrap my head around using first. and last. here, but I suppose the problem here is that I view temporally displaced flagged groups/streaks as different groups, while SAS looks at them as they are only separated by flag, so a simple "take first. from last." is not sufficient.
I was also thinking of collapsing the flags to a string and using a regex lookahead, but I couldn't come up with either the method or the pattern.
I would just code a double DOW loop. The first will let you calculate the observation for this ID that you want to output and the second will read through the records again and output the selected observation.
You can use the NOTSORTED keyword on the BY statement to have SAS calculate the FIRST.FLAG variable.
data have;
input ID TIME FLAG;
cards;
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
;
data want;
do obs=1 by 1 until(last.id);
set have;
by id flag notsorted;
if first.flag then want=obs;
end;
do obs=1 to obs;
set have;
if obs=want then output;
end;
drop obs want;
run;
Loop through the dataset by id. Use the lag function to look at the current and previous value of flag. If the current value is 1 and the previous value is 0, or it's the first observation for that ID, write the value of time to a retained variable. Only output the last observation for each id. The retained variable should contain the time of the first flagged observation of the last flagged group:
data result;
set have;
by id;
retain firstflagged;
prevflag = lag(flag);
if first.id and flag = 1 then firstflagged = time;
else if first.id and flag = 0 then firstflagged = .;
else if flag = 1 and prevflag = 0 then firstflagged = time;
if last.id then output;
keep id firstflagged flag;
rename firstflagged = time;
run;

SAS LOOP - create columns from the records which are having a value

Suppose i have random diagnostic codes, such as 001, v58, ..., 142,.. How can I construct columns from the codes which is 1 for the records?
Input:
id found code
1 1 001
2 0 v58
3 1 v58
4 1 003
5 0 v58
......
......
15000 0 v58
Output:
id code_001 code_v58 code_003 .......
1 1 0 0
2 0 0 0
3 0 1 0
4 1 0 0
5 0 0 0
.........
.........
You will want to TRANSPOSE the values and name the pivoted columns according to data (value of code) with an ID statement.
Example:
In real world data it is often the case that missing diagnoses will be flagged zero, and that has to be done in a subsequent step.
data have;
input id found code $;
datalines;
1 1 001
2 0 v58
2 1 003 /* second diagnosis result for patient 2 */
3 1 v58
4 1 003
5 0 v58
;
proc transpose data=have out=want(drop=_name_) prefix=code_;
by id;
id code; * column name becomes <prefix><code>;
var found;
run;
* missing occurs when an id was not diagnosed with a code;
* if that means the flag should be zero (for logistic modeling perhaps)
* the missings need to be changed to zeroes;
data want;
set want;
array codes code_:;
do _n_ = 1 to dim(codes); /* repurpose automatic variable _n_ for loop index */
if missing(codes(_n_)) then codes(_n_) = 0;
end;
run;

Compare rows within group of size two

I've got the below code that works beautifully for comparing rows in a group when the first row doesnt matter.
data want_Find_Change;
set WORK.IA;
by ID;
array var[*] $ RATING;
array lagvar[*] $ zRATING;
array changeflag[*] RATING_UPDATE;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
end;
do i = 1 to dim(var) ;
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.ID);
end;
drop i;
run;
Unfortunately, when I use a dataset that has two rows per group I get incorrect returns, I'm assuming because the first row has to be used in the comparison. How can I compare the only to rows and a return only on the second row. This did not work:
data Change;
set WORK.Two;
by ID;
changeflag = last.RATING NE first.RATING;
run;
Example of the data I have and want
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
want
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
If you want only the first row to not be flagged, you first need to create a variable enumerating the rows within each group. You can do so with:
data temp;
set have;
count + 1;
by Group;
if first.Group then count = 1;
run;
In a second step, you can run a proc sql with a subquery, count distinct by groups, and case when:
proc sql;
create table want as
select
Group, Name, Sport, DogName, Eligibility,
case when count_name > 1 and count > 1 then 1 else 0 end as N_change,
case when count_sport > 1 and count > 1 then 1 else 0 end as S_change,
case when count_dog > 1 and count > 1 then 1 else 0 end as D_change,
case when count_E > 1 and count > 1 then 1 else 0 end as E_change
from (select *,
count(distinct(Name)) as count_name,
count(distinct(Sport)) as count_sport,
count(distinct(DogName)) as count_dog,
count(distinct(Eligibility)) as count_E
from temp
group by Group);
quit;
Best,

subset of dataset using first and last in sas

Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;