Aggregating multiple observations depending on validity ranges - sas
I am facing a problem regarding the concatenation of multiple observations depending on validity ranges. The function I am trying to reproduce is similar to the Listagg() function in Oracle but I want to use it with regards to validity ranges.
Here is a reproducible minimal dataset:
data have;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
3,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,10JAN2016:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y,01JAN2019:00:00:00,31DEC2019:00:00:00
5,MEMBER,N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
What I would like to do is to concatenate the value variable for each group of var inside an id.
However, there a multiple types of cases:
If there is only one value for a given var inside an id, don't do anything (e.g. case of id=1 in my example)
If validity ranges are consecutive, output every value of var inside an id (e.g. case of id=2)
If validity ranges are the same for the same var inside an id, concatenate them altogether (e.g. case of id=3)
If validity ranges are overlapping, the range that shares the two value of var are concatenated with the corresponding validity range (e.g. case of id=4)
If there are multiple validity ranges that are non consecutive for the same value in var inside an id, concatenate each that shares the same validity ranges (e.g. case of id=5)
Here is the desired result:
Following #Kiran's answer on how to do Listagg function in SAS and #Joe's answer on List Aggregation and Group Concatenation in SAS Proc SQL, I tried to use the CATX function.
This is my attempt:
proc sort data=have;
by id var start;
run;
data staging1;
set have;
by id var start;
if first.var then group_number+1;
run;
/* Simulate LEAD() function in SAS */
data staging2;
merge staging1 staging1(firstobs = 2
keep=group_number start end
rename=(start=lead_start end=lead_end group_number=nextgrp));
if group_number ne nextgrp then do;
lead_start = .;
lead_end = .;
end;
drop nextgrp;
format lag_: datetime20.;
run;
proc sort data=staging2;
by id var group_number start;
run;
data want;
retain _temp;
set staging2;
by id var group_number;
/* Only one obs for a given variable, output directly */
if first.group_number = 1 and last.group_number = 1 then
output;
else if first.group_number = 1 and last.group_number = 0 then
do;
if lead_start ne . and lead_end ne .
and ((lead_start < end) or (lead_end < start)) then
do;
if (lead_start = start) or (lead_end = end) then
do;
retain _temp;
_temp = value;
end;
if (lead_start ne start) or (lead_end ne end) then
do;
_temp = value;
end = intnx('dtday',lead_start,-1);
output;
end;
end;
else if lead_start ne . and lead_end ne . and intnx('dtday', end, 1) = lead_start then
do;
_temp = value;
output;
end;
else output;
end;
else if first.group_number = 0 and last.group_number = 1 then
do;
/* Concatenate preceded retained value */
value = catx(";",_temp, value);
output;
call missing(_temp);
end;
else output;
drop _temp lead_start lead_end group_number;
run;
My attempt did not solve all the problems. Only the cases of id=1 and id=3 were correctly output. I am starting to think that the use of first. and last. as well as the simulated LEAD() function might not be the most optimal one and that there is a probably a better way to do this.
Result of my attempt:
Desired results in data:
data want;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,09JAN2016:00:00:00
4,MEMBER,Y;N,10JAN2016:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,01JAN2018:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y;N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
It's pretty hard to do this in raw SQL, without built in windowing functions; data step SAS will have some better solutions.
Some of this depends on your data size. One example, below, does exactly what you ask for, but it probably will be impractical with your real data. Some of that is the 31DEC9999 dates - that makes for a lot of data - but even without that, this has thousands of rows per person, so if you have a million people or something this will get rather large. But, it might still be the best solution, depending on what you need - it does give you the absolute best control.
* First, expand the dataset to one row per day/value. (Hopefully you do not need the datetime - just the date.)
data daily;
set have;
do datevar = datepart(start) to datepart(end);
output;
end;
format datevar date9.;
drop start end;
run;
proc sort data=daily;
by id var datevar value;
run;
*Now, merge together the rows to one row per day - so days with multiple values will get merged into one.;
data merging;
set daily;
by id var datevar;
retain merge_value;
if (first.datevar and last.datevar) then output;
else do;
if first.datevar then merge_value = value;
else merge_value = catx(',',merge_value,value);
if last.datevar then do;
value = merge_value;
output;
end;
end;
keep id var datevar value;
run;
proc sort data=merging;
by id var value datevar;
run;
*Now, re-condense;
data want;
set merging;
by id var value datevar;
retain start end;
last_datevar = lag(datevar);
if first.value then do;
start = datevar;
end = .;
end;
else if last_datevar ne (datevar - 1) then do;
end = last_datevar;
output;
start = datevar;
end = .;
end;
if last.value then do;
end = datevar;
output;
end;
format start end date9.;
run;
I do not necessarily recommend doing this - it's provided for completeness, and in case it turns out it's the only way to do what you do.
Easier, most likely, is to condense using the data step using an event level dataset, where 'start' and 'end' are events. Here's an example that does what you require; it translates the original dataset to only 2 rows per original row, and then uses logic to decide what should happen for each event. This is pretty messy, so you'd want to clean it up for production, but the idea should work.
* First, make event level dataset so we can process the start and end separately;
data events;
set have;
type = 'Start';
dt_event = start;
output;
type = 'End';
dt_event = end;
output;
drop start end;
format dt_event datetime.;
run;
proc sort data=events;
by id var dt_event value;
run;
*Now, for each event, a different action is taken. Starts and Ends have different implications, and do different things based on those.;
data want;
set events(rename=value=in_value);
by id var dt_event;
retain start end value orig_value;
format value new_value $8.;
* First row per var is easy, just start it off with a START;
if first.var then do;
start = dt_event;
value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current VALUE from the concatenated VALUE string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = value;
do _i = 1 to countw(value,',');
if scan(orig_value,_i,',') ne in_value then new_value = catx(',',new_value,scan(orig_value,_i,','));
end;
orig_value = new_value;
if last.dt_event then do;
end = dt_event;
output;
start = dt_event + 86400;
value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end = dt_event - 86400;
if start < end and not missing(value) then output;
value = catx(',',value,in_value);
start = dt_event;
end = .;
end;
end;
format start end datetime21.;
keep id var value start end;
run;
Last, I'll leave you with this: you probably work in insurance, pharma, or banking, and either way this is a VERY solved problem - it's done a lot (this sort of windowing). You shouldn't really be writing new code here for the most part - first look in your company, and then if not, look for papers in either PharmaSUG or FinSUG or one of the other SAS user groups, where they talk about this. There's probably several dozen implementations of code that does this already published.
Related
Combing/Collapsing binary variables into a single row by patient ID in SAS
I am trying to collapse my multiple rows of binary variables into a single row per patient id as depicted in my illustration. Could someone please help me with the SAS code to do this? Thanks
If the rule is that to set it to 1 if it is ever 1 then take the MAX. If the rule is to set it to one only if all of them are one then take the MIN. proc summary data=have nway ; by id; output out=want max= ; run;
Update trick data want; update have(obs=0) have; by id; run; Or proc sql; create table want as select ID, max('2018'n) as Y2018, max('2019'n) as Y2019, max('2020'n) as Y2020 from have group by ID order by ID; quit; Untested because you provided data as images, please post as text, preferably as a data step.
Here is a data step-based solution. Certainly more complex than the above answers, but it does show ways you can use arrays, first. and last. processing, and the retain statement. Use a retained temporary array to hold the values of 2018-2020 until the last observation of each id group. On the last value of each id, check if each held value is 1 and set each value of the year to a 1 or 0. data want; set have; by id; array year[3] '2018'n--'2020'n; array hold[3] _TEMPORARY_; retain hold; if(first.id) then call missing(of hold[*]); do i = 1 to dim(year); if(year[i] = 1) then hold[i] = 1; end; if(last.id) then do; do i = 1 to dim(year); year[i] = (hold[i] = 1); end; output; end; drop i; run;
Loop over year data
I have the following issue. I want to calculate value for each month in the the data containing all year. but I do not really want to export all month to the separate tables. Is there any way to make it work with for loop or any other one? data BD.data2017r; retain Open; retain Close; set BD.data2017 end=eof curobs=observ1; if observ1 = 1 then do; Open = Close; end; if eof then do; Close = Close; end; absout = Close - Open; relative = (Close - Open)/Open; run;
A do until loop can process a by-group and maintain marker values without a need for retaining variables or explicit output: data have; do date = '01jan2015'd to '31dec2016'd; month = intnx('month',date,0); close + 1; output; end; format date yymmdd10. month yymmd7.; run; data want; do until (last.month); set have; by month; if first.month then open = close; end; range = close - open; relative = range / open; keep month open close range relative; run; A non-do loop approach that does the same would need to retain the open value and requires an explicit output statement. data want2; set have; by month; retain open; if first.month then open = close; if last.month then do; range = close - open; relative = range / open; OUTPUT; end; keep month open close range relative; run;
Assuming the data is sorted by date you can use BY-group processing (by month_var), with first.month_var and last.month_var as the markers indicating the the start and end of each BY-group (i.e. month). You'll also need to explicitly output; one row for each month in the if last.month_var block.
Indicator variable for maximum value by groups
Is there any more elegant way than that presented below for the following task: to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise data have; call streaminit(4321); do key1=1 to 10; do key2=1 to 5; do x1=rand("uniform"); x2=rand("Normal"); output; end; end; end; run; proc means data=have noprint; by key1; var x1 x2; output out=max max= / autoname; run; data want; merge have max; by key1; drop _:; run; proc sql; title "MAX"; select name into :MAXvars separated by ' ' from dictionary.columns WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max" order by name; quit; title; data want; set want; array MAX (*) &MAXvars; array XVars (*) x1 x2; array Indicators (*) MAX_X1 MAX_X2; do i=1 to dim(MAX); if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0; end; drop i; run; Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable. data have; call streaminit(4321); do key1=1 to 10; do key2=1 to 5; do x1=rand("uniform"); x2=rand("Normal"); output; end; end; end; run; proc sql; create table want as select key1, key2, x1, x2, case when x1 = max(x1) then 1 else 0 end as max_x1, case when x2 = max(x2) then 1 else 0 end as max_x2 from have group by key1 order by key1, key2; quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop. data have; call streaminit(4321); do key1=1 to 10; do key2=1 to 5; do x1=rand("uniform"); x2=rand("Normal"); output; end; end; end; run; /*Sort by key1 (or generate index) if not already sorted*/ proc sort data = have; by key1; run; data want; if 0 then set have; array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max; /*1st DOW-loop*/ do _n_ = 1 by 1 until(last.key1); set have; by key1; do i = 1 to 2; xvars[3,i] = max(xvars[1,i],xvars[3,i]); end; end; /*2nd DOW-loop*/ do _n_ = 1 to _n_; set have; do i = 1 to 2; xvars[2,i] = (xvars[1,i] = xvars[3,i]); end; output; end; drop i t_:; run; This may be a bit complicated to understand, so here's a rough explanation of how it flows: Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet. Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row. Go back to first DOW-loop, read the next by-group and repeat.
SAS adding new observations for longitudinal data
I have a longitudinal dataset in SAS with periods of time categorized as either at risk for an event, or not at risk. Unfortunately, some time periods overlap, and I would like to recode them to have a dataset of entirely non-overlapping observations. For example, the dataset currently looks like: Row 1: ID=123; Start=Jan 1, 1999; End=Dec 31, 1999; At_risk="Yes" Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No" The dataset I would like looks like: Row 1: ID=123; Start=Jan 1, 1999; End=Feb 1, 1999; At_risk="Yes" Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No" Row 3: ID=123; Start=Feb 15, 1999; End=Dec 31, 1999; At_risk="Yes" Thoughts?
Vasja may have been suggesting something like this (date level) as an alternative. I will assume here that the most recent row read in your longitudinal dataset will have priority over any other rows with overlapping date ranges. If that is not the case then adjust the priority derivation below as appropriate. Are you sure your start and end dates are correct. Your desired output still has overlapping dates. Feb 1 & 15 are both At Risk and not At Risk. Your End date should be at least one day before the next start date. Not the same day. End and Start dates should be contiguous. For that reason coding a solution that produces your desired output (with overlapping dates) is problematic. The solution below is based on no overlapping dates. You will need to modify it to include overlapping dates as per your required output. /* Your longitudinal dataset . */ data orig; format Id 16. Start End Date9.; Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output; Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output; run; /* generate a row for each date between start and end dates. */ /* Use row number (_n_) to assign priorioty. */ Data overlapping_dates; set orig; foramt date date9.; priority = _n_; do date = start to end by 1; output; end; Run; /* Get at_risk details for most recent read date according to priority. */ Proc sql; create table non_overlapping_dates as select id, date, at_risk from overlapping_dates group by id, date having priority eq max (priority) order by id, date ; Quit; /* Rebuild longitudinal dataset . */ Data longitudinal_dataset (keep= id start end at_risk) ; format id 16. Start End Date9. at_risk $3.; set non_overlapping_dates; by id at_risk notsorted; retain start; if first.at_risk then start = date; /* output a row to longitudinal dataset if at_risk is about to change or last row for id. */ if last.at_risk then do; end = date; output; end; Run;
Such tasks are exercises in debugging of program logic and fighting data assumptions, playing with old/new values... Below my initial code for the exact example you provided, will surely need some adjustment on real data. In case there's time overlap on more than current-next record I'm not sure it's doable this way (with reasonable effort). For such cases you'd probably be more effective with splitting original start - end intervals to day level and then summarize details to new intervals. data orig; format Id 16. Start End Date9.; Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output; Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output; run; proc sort data = orig; by ID Start; run; data modified; format pStart oStart pEnd oEnd Date9.; set orig; length pStart pEnd 8 pAt_risk $3; by ID descending End ; retain pStart pEnd pAt_risk; /* keep original values */ oStart = Start; oEnd = End; oAt_risk = At_risk; if first.id then do; pStart = Start; pEnd = End; pAt_risk = At_risk; /* no output */ end; else do; if pAt_risk ne At_risk then do; if Start > pStart then do; put _all_; Start = pStart; End = oStart; At_risk = pAt_risk; output;/* first part of time span */ Start = oStart; End = oEnd; At_risk = oAt_risk; output;/* second part of time span */ if (End < pEnd ) then do; Start = End; End = pEnd; At_risk = pAt_risk; output; /*third part of time span */ /* keep current values as previous record values */ pStart = max(oStart, Start); pEnd = End; pAt_risk = At_risk; end; end; end; end; run; proc print;run;
Mark timestamps into session in SAS
I have a series of dataset with two variables: uid and timestamp. I want to create a new variable called "session_num" to parse the timestamps into session numbers (when two timestampes are 30 min + apart, it will be marked as a new session). For example: I try to use retain statement in sas to loop through the timestamp, but it didn't work. Here's my code: Data test; SET test1; By uid; RETAIN session_num session_len; IF first.uid THEN DO; session=1; session_len=0; END; session_len=session_len+timpestamp; IF timpestamp-session_len>1800 THEN session_num=session_num+1; ELSE session_num=session_num; IF last.uid; KEEP uid timestamp session_num; RUN; Really appreciate if you could point out my mistake, and suggest the right solution. Thanks!
First, here is some sample input data (in the future, you should supply your own code to generate the sample input data so others don't have to spend time doing this for you), data test; input uid$ timestamp : DATETIME.; format timestamp DATETIME.; datalines; a 05jul2014:03:55:00 a 05jul2014:03:57:00 a 07jul2014:20:08:00 a 10jul2014:19:02:00 a 10jul2014:19:05:00 a 11jul2014:14:39:00 ; Then you can create the session variable as you defined it with data testsession; set test; retain last; retain session 0; by uid; if first.uid then do; session = session +1; last = timestamp; end; if (timestamp-last)/60>15 then do; session = session+1; end; last = timestamp; drop last; run;
MrFlick's method is probably the more normal way to do this, but another option involves the look-ahead self-merge. (Yes, look-ahead, even though this is supposed to look behind - look behind is more complicated in this manner.) data want; retain session 1; *have to initialize to 1 for the first record!; merge have(in=_h) have(rename=(timestamp=next uid=nextuid) firstobs=2); output; *output first, then we adjust session for the next iteration; if (uid ne nextuid) then session=1; else if timestamp + 1800 < next then session+1; drop next:; run;