How do I find first row of last group in SAS, where ordering matters? - sas

I'd like to ask help in this, as I am new to SAS, but a PROC SQL approach is usable as well.
My dataset has IDs, a time variable, and a flag. After I sort by id and time, I need to find the first flagged observation of the last flagged group/streak. As in:
ID TIME FLAG
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
Here I want my script to return the row where time is 8 for ID 1, as it is the first observation from the last "streak", or flagged group. For ID 2 it should be where time is 3.
Desired output:
ID TIME FLAG
1 8 1
2 3 1
I'm trying to wrap my head around using first. and last. here, but I suppose the problem here is that I view temporally displaced flagged groups/streaks as different groups, while SAS looks at them as they are only separated by flag, so a simple "take first. from last." is not sufficient.
I was also thinking of collapsing the flags to a string and using a regex lookahead, but I couldn't come up with either the method or the pattern.

I would just code a double DOW loop. The first will let you calculate the observation for this ID that you want to output and the second will read through the records again and output the selected observation.
You can use the NOTSORTED keyword on the BY statement to have SAS calculate the FIRST.FLAG variable.
data have;
input ID TIME FLAG;
cards;
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
;
data want;
do obs=1 by 1 until(last.id);
set have;
by id flag notsorted;
if first.flag then want=obs;
end;
do obs=1 to obs;
set have;
if obs=want then output;
end;
drop obs want;
run;

Loop through the dataset by id. Use the lag function to look at the current and previous value of flag. If the current value is 1 and the previous value is 0, or it's the first observation for that ID, write the value of time to a retained variable. Only output the last observation for each id. The retained variable should contain the time of the first flagged observation of the last flagged group:
data result;
set have;
by id;
retain firstflagged;
prevflag = lag(flag);
if first.id and flag = 1 then firstflagged = time;
else if first.id and flag = 0 then firstflagged = .;
else if flag = 1 and prevflag = 0 then firstflagged = time;
if last.id then output;
keep id firstflagged flag;
rename firstflagged = time;
run;

Related

Using a counter to find multiple occurences on same day from ID & date

I am trying find when a person has multiple occurences on the same day & when they do not.
My data looks something like this
data have;
input id date ;
datalines ;
1 nov10
1 nov15
2 nov11
2 nov11
2 nov14
3 nov12
4 nov17
4 nov19
4 nov19
etc...;
I want to create a new variable to show when an occurence happens on the same day or not. I want my end rseult to look like
data want;
input id date occ;
1 nov10 1
1 nov15 1
2 nov11 2
2 nov11 2
2 nov14 1
3 nov12 1
4 nov17 1
4 nov19 2
4 nov19 2
etc...;
THis is what I tried but it is not working for each date instead only doing it if the date repeats on the first. Here is my code
data want ;
set have ;
by id date;
if first.date then occ = 1;
else occ = 2;
run;
Your IF/THEN logic is just a complicated way to do
occ = 1 + not first.date;
Which is just a test of whether or not it is the first observation for this date.
Looks like you want to instead test whether or not there are multiple observations per date.
occ = 1 + not (first.date and last.date) ;

Create a running counter based on ID and date

I have 3 variables and a counter has to be created based on them.
Input:
ID window start window end
1 29oct20 12mar21
1 31oct20 08Feb21
1 31oct21 08feb21
1 31oct21 08feb21
2 06Nov20 11Apr21
2 06Nov20 11Apr21
2 27Nov20 01Apr19
Expected output:
ID window start window end priority_count
1 29oct20 12mar21 1
1 31oct20 08Feb21 2
1 31oct21 08feb21 2
1 31oct21 08feb21 2
2 06Nov20 11Apr21 1
2 06Nov20 11Apr21 1
2 27Nov20 01Apr19 2
So for every ID a new count should start once a new date comes.
I have been using this code
data want;
set have;
by ID window_start window_end;
if first.ID and first.window_start and first.window_endthen priority_count=1;
else priority_count+1;
run;
But it gives:
priority_count
1
2
3
4
1
2
3
Not sure if those are typos but there are several observations for which window_start is after window_end.
Using the LAG function
data want;
set have;
by id;
_lag=lag(window_start);
if first.id then priority_count=1;
else do;
if window_start ne _lag then
priority_count + 1;
end;
drop _lag;
run;
ID window_start window_end priority_count
1 29OCT2020 12MAR2021 1
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
2 06NOV2020 11APR2021 1
2 06NOV2020 11APR2021 1
2 27NOV2020 01APR2019 2
I think you're on the right track but need a slight modifications on your IF statements to reflect the logic.
Set to 0 at first of each ID
Increment if the window_end changes (or window_start since they're consistent in your example). Setting it to 0 initially means you can increment without worrying if it's the first or not.
data want;
set have;
by ID window_start window_end;
if first.ID then priority_count=0;
if first.window_end then priority_count+1;
run;

SAS and do loop

I'm writing a program in SAS.
Here's the dataset I have:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
For each ID, I want to delete the record if variable huuse ne 1, until I get to the first huuse=1. Then I want to keep that record and all subsequent records for that id, no matter what value huuse is. So for id=1, I want to delete the first two records than keep all records for id=1 starting with the 3rd record. For id=2, the first record has huuse=1, so I want to keep all records for id=2.
The data set I want should look like this:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
I tried this code, but it removes all records that have huuse ne 1.
data want;
set have;
by id;
do until (huuse=1);
if huuse = 1 then LEAVE;
if huuse ne 1 then DELETE;
END;
run;
I've tried several variations of do loops, but they all do the same thing.
The DATA step is a program with an implicit loop that reads every record of the data set specified in the SET statement. Any program data vector (pdv) variables not coming from the data set are, by default, reset to missing at the top of the implicit loop. You change that behavior using a RETAIN statement to name variables that should not get reset.
So, in your problem you have two situations when a tracking variable is needed. The variable will track the state of the condition Have I seen huuse=1 yet in this group ?. Call this variable one_flag
RETAIN one_flag; so you control when it's value changes
At the start of a BY group one_flag needs to be reset to false (0)
When huuse is first seen as 1 set the flag to true (1)
Example:
data want(drop=one_flag);
set have;
by id;
retain one_flag 0;
if first.id then one_flag = 0;
if not one_flag and huuse = 1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
run;
You can place the SET and BY statement inside an explicit DO and that changes the operating behavior of the program, especially if the explicit loop is terminated according to a LAST.<var> automatic variable. Such a loop is commonly called a DOW loop by SAS programmers. There is no phrase DOW loop in the SAS documentation.
Example:
data want;
do until (last.id);
set have;
by id;
if not one_flag and huuse=1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
end;
run;
Because the looping is explicit and never reaches the TOP of the program with in the loop, there is no need to RETAIN the flag variable, nor reset it. Program variables that are not retained are reset automatically at the top of the program, and the top of the program is only reached at the start of the BY group. Learn more about this programming construct in the SGF 2013 paper "The Magnificent DO", Paul M. Dorfman
Your source and result are same :-)
But if I understood your question correctly the solution is quite simple with a retain solution. I add 2 lines to the example to make it clear that I understood correctly.
The code with example table:
data test;
id=1;huuse=0;days=4;output;
id=1;huuse=0;days=3;output;
id=1;huuse=1;days=12;output;
id=1;huuse=1;days=1;output;
id=1;huuse=2;days=15;output;
id=2;huuse=1;days=13;output;
id=2;huuse=0;days=16;output;
id=2;huuse=1;days=18;output;
id=2;huuse=0;days=44;output;
id=3;huuse=0;days=1;output;
id=3;huuse=1;days=2;output;
run;
data test_output;
set test;
retain keep_id -1;
if (keep_id ne id and huuse ne 0) then keep_id=id;
if keep_id = id then output;
run;
/* the results:
id huuse days
1 1 12 1
1 1 1 1
1 2 15 1
2 1 13 2
2 0 16 2
2 1 18 2
2 0 44 2
3 1 2 3
*/

subset of dataset using first and last in sas

Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;

SAS identify combinations of two variables using count

I have the following dataset
data input;
input Row$ A B;
datalines;
1 1 2
2 1 2
3 1 1
4 1 1
5 2 3
6 2 3
7 2 3
8 2 2
9 2 2
10 2 1
;
run;
My goal is only to keep records of the first group of data for the variable A. For example I only want records where A=1 and B=2 (lines 1 and 2) and for the next group where A=2 and B=3 and so on...
I tried the following code
data input (rename= (count=rank_b));
set input;
count + 1;
by A descending B;
if first.B then count = 1;
run;
which just gives the number of observations in A (1 to 4) and B (1 to 6). What I would like is
A B rank_b rank_b_desired
1 2 1 1
1 2 2 1
1 1 1 2
1 1 2 2
2 3 1 1
2 3 2 1
2 2 1 2
2 2 2 2
2 1 1 3
So that I can then eliminate all obs where rank_b_desired does not equal 1.
Set a flag to 1 when you encounter a new value of A, then set it to 0 if B changes. retain will preserve the value of the flag when a new line is read from the input.
data want;
set input;
by A descending B;
retain flag;
if first.B then flag = 0;
if first.A then flag = 1;
run;
The desired result can also be achieved via proc sql, with the added benefit that it does not depend on the data being pre sorted.
proc sql;
create table want as
select *
from input
group by A
having B = max(B)
order by Row;
quit;
Or to match user234821's output:
proc sql;
create table want as
select
*,
ifn(B = max(B), 1, 0) as flag
from input
group by A
order by Row;
quit;