Retaining max values over multiple columns - sas

I have a dataset like below, and want to collapse a subject so that I can see if they were diagnosed with a disease at all within the past 3 years using SAS. Disease1-3 are binary yes/no flags.
For example - for subject a in 2021, since they had all 3 diseases in the prior year of 2020, they should also have flags for all those diseases in 2021 and 2022.
subject
year
disease1
disease2
disease 3
a
2020
1
1
1
a
2021
0
0
0
a
2022
0
0
0
b
2020
0
1
0
b
2021
1
0
0
b
2022
0
0
1
I'm hoping it would look something like this.
subject
year
disease1
disease2
disease 3
a
2020
1
1
1
a
2021
1
1
1
a
2022
1
1
1
b
2020
0
1
0
b
2021
1
1
0
b
2022
1
1
1
What would be the best way about going to do this? I've tried using a do loop and the retain statement, but get stuck due to the fact that there are multiple columns to consider (disease1-disease3).

Store the max value of disease into a temporary variable. Retain this for each group. If the stored max value is ever 1, set all subsequent values to be 1 for each disease.
data want;
set have;
by subject year;
array disease[*] disease1-disease3;
array disease_max[3] _temporary_;
retain disease_max;
do i = 1 to dim(disease);
if(first.subject) then disease_max[i] = 0; /* Reset disease max counter for each subject */
if(disease[i] = 1) then disease_max[i] = 1; /* Store max disease value */
if(disease_max[i] = 1) then disease[i] = 1; /* Set disease to 1 if disease_max is 1 */
end;
drop i;
run;

data have;
input subject $ year disease1 disease2 disease3;
datalines;
a 2020 1 1 1
a 2021 0 0 0
a 2022 0 0 0
b 2020 0 1 0
b 2021 1 0 0
b 2022 0 0 1
;
data temp;
set have;
array d disease:;
do over d;
if d = 0 then d = .;
end;
run;
data want;
update temp(obs=0) temp;
by subject;
array d disease:;
do over d;
if d = . then d = 0;
end;
output;
run;

Related

How to Count Distinct for SAS PROC SQL with Rolling Date Window of 5 years?

I want to count the distinct values of a variable grouped by MEMBER_ID and a rolling date range of 5 years. I have seen a similar post.
How to Count Distinct for SAS PROC SQL with Rolling Date Window?
When I change h2.DATE BETWEEN h.DATE - 180 AND h.DATE to h2.year BETWEEN h.year-5 AND h.year, should it give me the correct distinct count within the last 5 years? Thank you in advance.
data have;
input permno year Cand_ID$;
datalines;
1 2000 1
1 2001 2
1 2002 3
1 2003 1
1 2004 3
1 2005 1
2 2000 1
2 2001 3
2 2002 1
2 2003 2
2 2004 2
2 2005 2
2 2006 1
2 2007 1
3 2001 3
3 2002 3
3 2003 3
3 2004 1
3 2005 1
;
run;
Here's how you can do it with a data step. This assumes you have values for all years. If you do not, fill it in with zeros.
Keep a rolling list of the last 5 years by using the lag function. If we keep a rolling sorted array list of the last 5 years using lag, we can count the distinct values for each row to get a rolling 5-year count.
In other words, we're going to create and count a list that looks like this:
permno year id1 id2 id3 id4 id5
1 2000 . . . . 1
1 2001 . . . 1 2
1 2002 . . 1 2 3
1 2003 . 1 1 2 3
Code:
data want;
set have;
by permno year;
array lagid[4] $;
array id[5] $;
id1 = cand_id;
lagid1 = lag1(cand_id);
lagid2 = lag2(cand_id);
lagid3 = lag3(cand_id);
lagid4 = lag4(cand_id);
/* Reset the counter for the first group */
if(first.permno) then n = 0;
/* Count the number of rows within a group */
n+1;
/* Save the last 5 years by using the lag function,
but do not get lags from previous groups
*/
do i = 1 to 4;
if(i < n) then id[i+1] = lagid[i];
end;
/* Sort the array of IDs into ascending order */
call sortc(of id:);
/* Count the number of distinct IDs in the array. Do not count
missing values.
*/
n_distinct = 1;
do i = 2 to dim(id);
if(id[i] > id[i-1] AND NOT missing(id[i-1]) ) then n_distinct+1;
end;
drop lag: n i;
run;
Output (without id: dropped):
permno year Cand_ID id1 id2 id3 id4 id5 n_distinct
1 2000 1 . . . . 1 1
1 2001 2 . . . 1 2 2
1 2002 3 . . 1 2 3 3
1 2003 1 . 1 1 2 3 3
1 2004 3 1 1 2 3 3 3
1 2005 1 1 1 2 3 3 3

SAS LOOP - create columns from the records which are having a value

Suppose i have random diagnostic codes, such as 001, v58, ..., 142,.. How can I construct columns from the codes which is 1 for the records?
Input:
id found code
1 1 001
2 0 v58
3 1 v58
4 1 003
5 0 v58
......
......
15000 0 v58
Output:
id code_001 code_v58 code_003 .......
1 1 0 0
2 0 0 0
3 0 1 0
4 1 0 0
5 0 0 0
.........
.........
You will want to TRANSPOSE the values and name the pivoted columns according to data (value of code) with an ID statement.
Example:
In real world data it is often the case that missing diagnoses will be flagged zero, and that has to be done in a subsequent step.
data have;
input id found code $;
datalines;
1 1 001
2 0 v58
2 1 003 /* second diagnosis result for patient 2 */
3 1 v58
4 1 003
5 0 v58
;
proc transpose data=have out=want(drop=_name_) prefix=code_;
by id;
id code; * column name becomes <prefix><code>;
var found;
run;
* missing occurs when an id was not diagnosed with a code;
* if that means the flag should be zero (for logistic modeling perhaps)
* the missings need to be changed to zeroes;
data want;
set want;
array codes code_:;
do _n_ = 1 to dim(codes); /* repurpose automatic variable _n_ for loop index */
if missing(codes(_n_)) then codes(_n_) = 0;
end;
run;

SAS, calculate row difference

data test;
input ID month d_month;
datalines;
1 59 0
1 70 11
1 80 21
2 10 0
2 11 1
2 13 3
3 5 0
3 9 4
4 8 0
;
run;
I have two columns of data ID and Month. Column 1 is the ID, the same ID may have multiple rows (1-5). The second column is the enrolled month. I want to create the third column. It calculates the different between the current month and the initial month for each ID.
you can do it like that.
data test;
input ID month d_month;
datalines;
1 59 0
1 70 11
1 80 21
2 10 0
2 11 1
2 13 3
3 5 0
3 9 4
4 8 0
;
run;
data calc;
set test;
by id;
retain current_month;
if first.id then do;
current_month=month;
calc_month=0;
end;
if ^first.id then do;
calc_month = month - current_month ;
end;
run;
Krs

Filling in gaps between sequentially numbered records and updating a status indicator

In a summarized dataset, I have the status of an event at each hour after baseline in which it was recorded. I also have the last hour the event could have been recorded. I want to create a new dataset with one record for each hour from the first through the last hour, with the status for each record being the one from the last recorded status.
Here is an example dataset:
data new;
input hour status last_hour;
cards;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
run;
In this case, the first recorded hour was the second, and the last recorded hour was the 10th. The last possible hour to record data was the 12th.
The final dataset should look like so:
0 . 12
1 . 12
2 1 12
3 1 12
4 1 12
5 1 12
6 1 12
7 0 12
8 0 12
9 1 12
10 0 12
11 0 12
12 0 12
I sort of have it working with this series of data steps, but I'm not sure if there's a cleaner way I'm not seeing.
data step1;
set new (keep=id hour);
by id;
do hour = 0 to last_hour;
output;
end;
run;
proc sort data=step1;
by id hour;
run;
proc sql;
create table step2 as
select distinct a.id, a.hour, b.status
from step1 as a
left join new as b
on a.id = b.id
and a.hour = b.hour
order by a.id, a.hour;
quit;
data step3;
set step2;
by id hour;
retain previous_status;
if first.id then do;
previous_status = .;
if status > . then previous_status = status;
end;
if not first.id then do;
if status = . and previous_status > . then status = previous_status;
if status > . then previous_status = status;
end;
run;
Seeing your code, it seems you left out of your question the fact that you also have id's. So this is a newer solution that deals with different id's. See further below for my first solution ignoring id's.
Since last_hour is always 12, I left it out of the have dataset. It will be added later on.
data have;
input id hour status;
cards;
1 2 1
1 4 1
1 5 1
1 6 1
1 7 0
1 9 1
1 10 0
2 2 1
2 4 1
2 5 1
2 6 1
2 7 0
2 9 1
2 10 0
;
Create a hours dataset, just containing numbers 0 thru 12;
data hours;
do i = 0 to 12;
hour = i;
output;
end;
drop i;
run;
Create a temporary dataset that will have the right number of rows (13 rows for every id, with valid hour values where they exist in the have table).
proc sql;
create table tmp as
select distinct t1.id, t2.hour, 12 as last_hour
from have as t1
cross join
(select hour from hours) as t2;
quit;
Then use merge and retain to fill in the missing hour column where appropriate.
data want;
merge have
tmp;
by id hour;
retain status_previous;
if not first.id then do;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
end;
if last.id then status_previous = .;
drop status_previous;
run;
Previous solution (no id's)
If last_hour is always 12, then this should do it:
data have;
input hour status last_hour;
datalines;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
data hours;
do i = 0 to 12;
hour = i;
last_hour = 12;
output;
end;
drop i;
run;
data want;
merge have
hours;
by hour;
retain status_previous;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
drop status_previous;
run;

SAS creating identifier

I have a dataset with patient having multiple courses during a treatment phase.
Data set looks like:
C 1 1 0
C 0 0 1
C 1 1 0
C 0 0 1
The first two rows: patient start at row1 and finishes at row2. This is the first course of patient C.
The second two rows: patient C again starts at row3 and finishes at row four.
How can I create an identifier for these two courses using the first and last statements in SAS.
Expected output should look like this;
C 1 1 0 23
C 0 0 1 23
C 1 1 0 24
C 0 0 1 24
C 1 1 1 25
The counts for one course should be the same and different from courses to courses within he same patient.
Thanks.
Assuming the third variable, whatever it is, is your 'end state' the following works. Probably not the easiest method but hopefully clear. I don't know if First/Last will actually help in this situation except for when the ID switches.
Idea is look for the V3=1 and then set a flag to 1. If the flag is 1, then the next record increments and resets the flag and the process is continued. Retain is used to hold the values of Flag and Course across the rows.
data have;
input ID $ v1-v3;
cards;
C 1 1 0
C 0 0 1
C 1 1 0
C 0 0 1
D 1 0 0
D 0 1 0
D 0 0 1
;
run;
data want;
set have;
BY ID;
retain flag 0 course;
if first.ID then do;
Course=1;
flag=0;
end;
if flag=1 then do;
course=course+1;
flag=0;
end;
else if v3=1 and flag=0 then flag=1;
run;
proc print;
run;