SAS adding new observations for longitudinal data

SAS adding new observations for longitudinal data - sas

I have a longitudinal dataset in SAS with periods of time categorized as either at risk for an event, or not at risk. Unfortunately, some time periods overlap, and I would like to recode them to have a dataset of entirely non-overlapping observations. For example, the dataset currently looks like:
Row 1: ID=123; Start=Jan 1, 1999; End=Dec 31, 1999; At_risk="Yes"
Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No"
The dataset I would like looks like:
Row 1: ID=123; Start=Jan 1, 1999; End=Feb 1, 1999; At_risk="Yes"
Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No"
Row 3: ID=123; Start=Feb 15, 1999; End=Dec 31, 1999; At_risk="Yes"
Thoughts?

Vasja may have been suggesting something like this (date level) as an alternative.
I will assume here that the most recent row read in your longitudinal dataset will have priority over any other rows with overlapping date ranges. If that is not the case then adjust the priority derivation below as appropriate.
Are you sure your start and end dates are correct. Your desired output still has overlapping dates. Feb 1 & 15 are both At Risk and not At Risk. Your End date should be at least one day before the next start date. Not the same day. End and Start dates should be contiguous. For that reason coding a solution that produces your desired output (with overlapping dates) is problematic. The solution below is based on no overlapping dates. You will need to modify it to include overlapping dates as per your required output.
/* Your longitudinal dataset . */
data orig;
format Id 16. Start End Date9.;
Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output;
Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output;
run;
/* generate a row for each date between start and end dates. */
/* Use row number (_n_) to assign priorioty. */
Data overlapping_dates;
set orig;
foramt date date9.;
priority = _n_;
do date = start to end by 1;
output;
end;
Run;
/* Get at_risk details for most recent read date according to priority. */
Proc sql;
create table non_overlapping_dates as
select id, date, at_risk
from overlapping_dates
group by id, date
having priority eq max (priority)
order by id, date
;
Quit;
/* Rebuild longitudinal dataset . */
Data longitudinal_dataset
(keep= id start end at_risk)
;
format id 16. Start End Date9. at_risk $3.;
set non_overlapping_dates;
by id at_risk notsorted;
retain start;
if first.at_risk
then start = date;
/* output a row to longitudinal dataset if at_risk is about to change or last row for id. */
if last.at_risk
then do;
end = date;
output;
end;
Run;

Such tasks are exercises in debugging of program logic and fighting data assumptions, playing with old/new values...
Below my initial code for the exact example you provided, will surely need some adjustment on real data.
In case there's time overlap on more than current-next record I'm not sure it's doable this way (with reasonable effort). For such cases you'd probably be more effective with splitting original start - end intervals to day level and then summarize details to new intervals.
data orig;
format Id 16. Start End Date9.;
Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output;
Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output;
run;
proc sort data = orig;
by ID Start;
run;
data modified;
format pStart oStart pEnd oEnd Date9.;
set orig;
length pStart pEnd 8 pAt_risk $3;
by ID descending End ;
retain pStart pEnd pAt_risk;
/* keep original values */
oStart = Start;
oEnd = End;
oAt_risk = At_risk;
if first.id then do;
pStart = Start;
pEnd = End;
pAt_risk = At_risk;
/* no output */
end;
else do;
if pAt_risk ne At_risk then do;
if Start > pStart then do;
put _all_;
Start = pStart;
End = oStart;
At_risk = pAt_risk;
output;/* first part of time span */
Start = oStart;
End = oEnd;
At_risk = oAt_risk;
output;/* second part of time span */
if (End < pEnd ) then do;
Start = End;
End = pEnd;
At_risk = pAt_risk;
output; /*third part of time span */
/* keep current values as previous record values */
pStart = max(oStart, Start);
pEnd = End;
pAt_risk = At_risk;
end;
end;
end;
end;
run;
proc print;run;

Related

Aggregating multiple observations depending on validity ranges

I am facing a problem regarding the concatenation of multiple observations depending on validity ranges. The function I am trying to reproduce is similar to the Listagg() function in Oracle but I want to use it with regards to validity ranges.
Here is a reproducible minimal dataset:
data have;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
3,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,10JAN2016:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y,01JAN2019:00:00:00,31DEC2019:00:00:00
5,MEMBER,N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
                            
What I would like to do is to concatenate the value variable for each group of var inside an id.
However, there a multiple types of cases:
If there is only one value for a given var inside an id, don't do anything (e.g. case of id=1 in my example)
If validity ranges are consecutive, output every value of var inside an id (e.g. case of id=2)
If validity ranges are the same for the same var inside an id, concatenate them altogether (e.g. case of id=3)
If validity ranges are overlapping, the range that shares the two value of var are concatenated with the corresponding validity range (e.g. case of id=4)
If there are multiple validity ranges that are non consecutive for the same value in var inside an id, concatenate each that shares the same validity ranges (e.g. case of id=5)
Here is the desired result:
                            
Following #Kiran's answer on how to do Listagg function in SAS and #Joe's answer on List Aggregation and Group Concatenation in SAS Proc SQL, I tried to use the CATX function.
This is my attempt:
proc sort data=have;
by id var start;
run;
data staging1;
set have;
by id var start;
if first.var then group_number+1;
run;
/* Simulate LEAD() function in SAS */
data staging2;
merge staging1 staging1(firstobs = 2
keep=group_number start end
rename=(start=lead_start end=lead_end group_number=nextgrp));
if group_number ne nextgrp then do;
lead_start = .;
lead_end = .;
end;
drop nextgrp;
format lag_: datetime20.;
run;
proc sort data=staging2;
by id var group_number start;
run;
data want;
retain _temp;
set staging2;
by id var group_number;
/* Only one obs for a given variable, output directly */
if first.group_number = 1 and last.group_number = 1 then
output;
else if first.group_number = 1 and last.group_number = 0 then
do;
if lead_start ne . and lead_end ne .
and ((lead_start < end) or (lead_end < start)) then
do;
if (lead_start = start) or (lead_end = end) then
do;
retain _temp;
_temp = value;
end;
if (lead_start ne start) or (lead_end ne end) then
do;
_temp = value;
end = intnx('dtday',lead_start,-1);
output;
end;
end;
else if lead_start ne . and lead_end ne . and intnx('dtday', end, 1) = lead_start then
do;
_temp = value;
output;
end;
else output;
end;
else if first.group_number = 0 and last.group_number = 1 then
do;
/* Concatenate preceded retained value */
value = catx(";",_temp, value);
output;
call missing(_temp);
end;
else output;
drop _temp lead_start lead_end group_number;
run;
My attempt did not solve all the problems. Only the cases of id=1 and id=3 were correctly output. I am starting to think that the use of first. and last. as well as the simulated LEAD() function might not be the most optimal one and that there is a probably a better way to do this.
Result of my attempt:
                            
Desired results in data:
data want;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,09JAN2016:00:00:00
4,MEMBER,Y;N,10JAN2016:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,01JAN2018:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y;N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;

It's pretty hard to do this in raw SQL, without built in windowing functions; data step SAS will have some better solutions.
Some of this depends on your data size. One example, below, does exactly what you ask for, but it probably will be impractical with your real data. Some of that is the 31DEC9999 dates - that makes for a lot of data - but even without that, this has thousands of rows per person, so if you have a million people or something this will get rather large. But, it might still be the best solution, depending on what you need - it does give you the absolute best control.
* First, expand the dataset to one row per day/value. (Hopefully you do not need the datetime - just the date.)
data daily;
set have;
do datevar = datepart(start) to datepart(end);
output;
end;
format datevar date9.;
drop start end;
run;
proc sort data=daily;
by id var datevar value;
run;
*Now, merge together the rows to one row per day - so days with multiple values will get merged into one.;
data merging;
set daily;
by id var datevar;
retain merge_value;
if (first.datevar and last.datevar) then output;
else do;
if first.datevar then merge_value = value;
else merge_value = catx(',',merge_value,value);
if last.datevar then do;
value = merge_value;
output;
end;
end;
keep id var datevar value;
run;
proc sort data=merging;
by id var value datevar;
run;
*Now, re-condense;
data want;
set merging;
by id var value datevar;
retain start end;
last_datevar = lag(datevar);
if first.value then do;
start = datevar;
end = .;
end;
else if last_datevar ne (datevar - 1) then do;
end = last_datevar;
output;
start = datevar;
end = .;
end;
if last.value then do;
end = datevar;
output;
end;
format start end date9.;
run;
I do not necessarily recommend doing this - it's provided for completeness, and in case it turns out it's the only way to do what you do.
Easier, most likely, is to condense using the data step using an event level dataset, where 'start' and 'end' are events. Here's an example that does what you require; it translates the original dataset to only 2 rows per original row, and then uses logic to decide what should happen for each event. This is pretty messy, so you'd want to clean it up for production, but the idea should work.
* First, make event level dataset so we can process the start and end separately;
data events;
set have;
type = 'Start';
dt_event = start;
output;
type = 'End';
dt_event = end;
output;
drop start end;
format dt_event datetime.;
run;
proc sort data=events;
by id var dt_event value;
run;
*Now, for each event, a different action is taken. Starts and Ends have different implications, and do different things based on those.;
data want;
set events(rename=value=in_value);
by id var dt_event;
retain start end value orig_value;
format value new_value $8.;
* First row per var is easy, just start it off with a START;
if first.var then do;
start = dt_event;
value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current VALUE from the concatenated VALUE string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = value;
do _i = 1 to countw(value,',');
if scan(orig_value,_i,',') ne in_value then new_value = catx(',',new_value,scan(orig_value,_i,','));
end;
orig_value = new_value;
if last.dt_event then do;
end = dt_event;
output;
start = dt_event + 86400;
value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end = dt_event - 86400;
if start < end and not missing(value) then output;
value = catx(',',value,in_value);
start = dt_event;
end = .;
end;
end;
format start end datetime21.;
keep id var value start end;
run;
Last, I'll leave you with this: you probably work in insurance, pharma, or banking, and either way this is a VERY solved problem - it's done a lot (this sort of windowing). You shouldn't really be writing new code here for the most part - first look in your company, and then if not, look for papers in either PharmaSUG or FinSUG or one of the other SAS user groups, where they talk about this. There's probably several dozen implementations of code that does this already published.

How to average a subset of data in SAS conditional on a date?

I'm trying to write SAS code that can loop over a dataset that contains event dates that looks like:
Data event;
input Date;
cards;
20200428
20200429
;
run;
And calculate averages for the prior three-days from another dataset that contains dates and volume that looks like:
Data vol;
input Date Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
For example, for date 20200428 the average should be 88.33 [(95+80+90)/3] and for date 20200429 the average should be 87.00 [(86+95+80)/3]. I want these values and the volume of the date to be saved on a new dataset that looks like the following if possible.
Data clean;
input Date Vol Avg;
cards;
20200428 86 88.33
20200429 110 87.00
;
run;
The actual data that I'm working with is from 1970-2010. I may also increase my average period from 3 days prior to 10 days prior, so I want to have flexible code. From what I've read I think a macro and/or call symput might work very well for this, but I'm not sure how to code these to do what I want. Honestly, I don't know where to start. Can anyone point me in the right direction? I'm open to any advice/ideas. Thanks.

A SQL statement is by far the most succinct code for obtaining your result set.
The query will join with 2 independent references to volume data. The first for obtaining the event date volume, and the second for computing the average volume over the three prior days.
The date data should be read in as a SAS date, so that the BETWEEN condition will be correct.
Data event;
input Date: yymmdd8.;
cards;
20200428
20200429
;
run;
Data vol;
input Date: yymmdd8. Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
* SQL query with GROUP BY ;
proc sql;
create table want as
select
event.date
, volume_one.volume
, mean(volume_two.volume) as avg
from event
left join vol as volume_one
on event.date = volume_one.date
left join vol as volume_two
on volume_two.date between event.date-1 and event.date-3
group by
event.date, volume_one.volume
;
* alternative query using correlated sub-query;
create table want_2 as
select
event.date
, volume
, ( select mean(volume) as avg from vol where vol.date between event.date-1 and event.date-3 )
as avg
from event
left join vol
on event.date = vol.date
;

For the case of the Volumes data being date gapped, a better solution would be to separately compute the rolling average of N prior volumes. The date gaps could be from weekends, holidays, or a date not present due to data entry problems or operator error. Conceptually, for the averaging, the only role of date is only to order the data.
After the rolling averages are computed, a simple join or merge can be done.
Example:
* Simulate some volume data that excludes weekends, holidays, and a 2% rate of missing dates;
data volumes(keep=date volume);
call streaminit(20200502);
do date = '01jan1970'd to today();
length holiday $25;
year = year(date);
holiday = 'NEWYEAR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USINDEPENDENCE'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'THANKSGIVING'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'CHRISTMAS'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'MEMORIAL'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'LABOR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'EASTER'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USPRESIDENTS'; hdate = holiday(holiday, year); if date=hdate then continue;
if weekday(date) in (1,7) then continue; *1=Sun, 7=Sat;
volume = 100 + ceil(75 * sin (date / 8));
if rand('uniform') < 0.02 then continue;
output;
end;
format date yymmdd10.;
run;
* Compute an N item rolling average from N prior values;
%let ROLLING_N = 5;
data volume_averages;
set volumes;
by date; * enforce sort order requirement;
array v[0:&ROLLING_N] _temporary_; %* <---- &ROLLING_N ;
retain index -1;
avg_prior_&ROLLING_N. = mean (of v(*)); %* <---- &ROLLING_N ;
OUTPUT;
index = mod(index + 1,&ROLLING_N); %* <---- Modular arithmetic, the foundation of rolling ;
v[index] = volume;
format v: 6.;
drop index;
run;
* merge;
data want_merge;
merge events(in=event_date) volume_averages;
by date;
if event_date;
run;
* join;
proc sql;
create table want_join as
select events.*, volume_averages.avg_prior_5
from events join volume_averages
on events.date = volume_averages.date;
quit;

You want to loop over a series of dates in an input data set. Therefore I use a PROC SQL statement where I select the distinct dates in this input data set into a macro variable.
This macro variable is then used to loop over. In your example the macro variable will thus be: 20200428 20200429. You can then use the %SCAN macro function to start looping over these dates.
For each date in the loop, we will then calculate the average: in your example the average of the 3 days prior to the looping date. As the number of days for which you want to calculate the average is variable, this is also passed as a parameter in the macro. I then use the INTNX function to calculate the lower bound of dates you want to select to calculate the average over. Then the PROC MEANS procedure is used to calculate the average volume over the days: lower bound - looping date.
I then put a minor data step in between to attach the looping date again to the calculated average. Finally everything is appended in a final data set.
%macro dayAverage(input = , range = , selectiondata = );
/* Input = input dataset
range = number of days prior to the selected date for which you want to calculate
the average
selectiondata = data where the volumes are in */
/* Create a macro variable with the dates for which you want to calculate the
average, to loop over */
proc sql noprint;
select distinct date into: datesrange separated by " "
from &input.;
quit;
/*Start looping over the dates for which you want to calculate the average */
%let I = 1;
%do %while (%scan(&datesrange.,&I.) ne %str());
/* Assign the current date in the loop to the variable currentdate */
%let currentdate = %scan(&datesrange.,&I.);
/* Create the minimum date in the range based on input parameter range */
%let mindate =
%sysfunc(putn(%sysfunc(intnx(day,%sysfunc(inputn(&currentdate.,yymmdd8.)),-
&range.)),yymmddn8.));
/* Calculate the mean volume for the selected date and selected range */
proc means data = &selectiondata.(where = (date >= &mindate. and date <
&currentdate.)) noprint ;
output out = averagecurrent(drop = _type_ _freq_) mean(volume)=avgerage_volume;
run;
/* Add the current date to the calculated average */
data averagecurrent;
retain date average_volume;
set averagecurrent;
date = &currentdate.;
run;
/* Append the result to a final list */
proc datasets nolist;
append base = final data = averagecurrent force;
run;
%let I = %eval(&I. + 1);
%end;
%mend;
This macro can in your example be called as:
%dayAverage(input = event, range = 3, selectiondata = vol);
It will give you a data set in your work library called final

SAS Macro error with call execute

I have the following code that is being used generate running totals of features for the past 1 day, 7 days, 1 month, 3 months, and 6 months.
LIBNAME A "C:\Users\James\Desktop\data\Base Data";
LIBNAME DATA "C:\Users\James\Desktop\data\Data1";
%MACRO HELPER(P);
data a1;
set data.final_master_&P. ;
QUERY = '%TEST('||STRIP(DATETIME)||','||STRIP(PARTICIPANT)||');';
CALL EXECUTE(QUERY);
run;
%MEND;
%MACRO TEST(TIME,PAR);
proc sql;
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_24, :APP_2_24, :APP_3_24, :APP_4_24, :APP_5_24
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24) AND &TIME.;
/* 7 Days */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_7DAY, :APP_2_7DAY, :APP_3_7DAY, :APP_4_7DAY, :APP_5_7DAY
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7) AND &TIME.;
/* One Month */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_1MONTH, :APP_2_1MONTH, :APP_3_1MONTH, :APP_4_1MONTH, :APP_5_1MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4) AND &TIME.;
/* Three Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_3MONTH, :APP_2_3MONTH, :APP_3_3MONTH, :APP_4_3MONTH, :APP_5_3MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*3) AND &TIME.;
/* Six Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_6MONTH, :APP_2_6MONTH, :APP_3_6MONTH, :APP_4_6MONTH, :APP_5_6MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*6) AND &TIME.;
quit;
DATA T;
PARTICIPANT = &PAR.;
DATETIME = &TIME;
APP_1_24 = &APP_1_24.;
APP_2_24 = &APP_2_24.;
APP_3_24 = &APP_3_24.;
APP_4_24 = &APP_4_24.;
APP_5_24 = &APP_5_24.;
APP_1_7DAY = &APP_1_7DAY.;
APP_2_7DAY = &APP_2_7DAY.;
APP_3_7DAY = &APP_3_7DAY.;
APP_4_7DAY = &APP_4_7DAY.;
APP_5_7DAY = &APP_5_7DAY.;
APP_1_1MONTH = &APP_1_1MONTH.;
APP_2_1MONTH = &APP_2_1MONTH.;
APP_3_1MONTH = &APP_3_1MONTH.;
APP_4_1MONTH = &APP_4_1MONTH.;
APP_5_1MONTH = &APP_5_1MONTH.;
APP_1_3MONTH = &APP_1_3MONTH.;
APP_2_3MONTH = &APP_2_3MONTH.;
APP_3_3MONTH = &APP_3_3MONTH.;
APP_4_3MONTH = &APP_4_3MONTH.;
APP_5_3MONTH = &APP_5_3MONTH.;
APP_1_6MONTH = &APP_1_6MONTH.;
APP_2_6MONTH = &APP_2_6MONTH.;
APP_3_6MONTH = &APP_3_6MONTH.;
APP_4_6MONTH = &APP_4_6MONTH.;
APP_5_6MONTH = &APP_5_6MONTH.;
FORMAT DATETIME DATETIME.;
RUN;
PROC APPEND BASE=DATA.FLAGS_&par. DATA=T;
RUN;
%MEND;
%helper(1);
This code runs perfectly if I limit the number of observations in the %helper macro, using an (obs=) in the creation of the a1 dataset. However, when I put no limit on the obs number, i.e. execute the %test macro for every row in the dataset a1, I get errors. In SAS EG, I get a "server disconnected" popup after the status bar hangs at "running data step", and on Base SAS 9.4 I get the error that none of the macro variables have been resolved that are created in the proc sql into.
I'm confused as the code works fine for a limited amount of observations, but when trying on the whole dataset it hangs or gives errors. The dataset I'm doing this for has around 130,000 observations.

The answer to your actual question is that you're simply generating too much macro code and perhaps even simply taking too much time. The way you are doing this is going to operate on an O=n^2 level, as you're basically doing a cartesian join of every record to every record, and then some. 130,000 * 130,000 is a pretty decent sized number, and on top of that you're actually opening the SQL environment several times for each 130,000 rows. Ouch.
The solution is to do it either in a way that isn't too slow, or if it is, in a way that won't have too much overhead at least.
The fast solution is to not do the cartesian join, or to limit how much needs to be joined. One good solution would be to restructure the problem, not require every record to be compared, but instead consider each calendar day, say, a period, especially in the over-24h periods (24h you might do the way you do above, but not the other four). 1 month, 3 month, etc., do you really need to figure out time of day? Probably won't make much difference. If you can get rid of that, then you can use built in PROCs to precompile all possible 1 month periods, all possible 3 month periods, etc., and then join on the appropriate one. But that won't work with 130,000 of them; it would only work if you could limit it to one per day, probably.
If you must do it at the second level (or worse), what you'll want to do is avoid the cartesian join, and instead keep track of the various records you've seen already, and the sums. The short explanation of the algorithm is:
For each row:
Add this row's values to the rolling sums (at the end of the queue)
Check if the current item of the queue is outside of the period; if it is, subtract it from the rolling sums, and check the next item (repeat until not outside of the period), updating the current queue position
Return the sum at this point
This requires checking each row typically twice (except at odd boundaries where you have no rows popped off for several iterations, due to months having different numbers of days). This operates on O=n time, much faster than the cartesian join, and on top of that has far less memory/space required (the cartesian join might need to hit disk space).
The hash version of this solution is below. This will be the fastest solution I think that compares every row. Note that I intentionally make the test data have 1 for every row and same number of rows for every day; that lets you see how it works on a row-wise manner very easily. (For example, every 24h period has 481 rows, because I made 480 rows per day exactly, and 481 includes the same time yesterday - if you change lt to le it will be 480, if you prefer not to include same time yesterday). You can see that the 'month' based periods will have slightly odd results at the boundaries where months change because the '01FEB20xx' to '01MAY20xx' period has far fewer days (and thus rows) than the '01JUL20xx' to '01OCT20xx' period, for example; better would be 30/90/180 day periods.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add(array=);
array app_&array. app_&array._1-app_&array._5;
%add(array=app_&array.);
%mend process_array_add;
%macro process_array_subtract(array=, period=, number=);
if _n_ eq 1 then do;
declare hiter hi_&array.('td');
rc_&array. = hi_&array..first();
end;
else do;
rc_&array. = hi_&array..setcur(key:firstval_&array.);
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and rc_&array.=0);
%subtract(array=app_&array.);
rc_&array. = hi_&array..next();
end;
retain firstval_&array.;
firstval_&array. = dt_var;
%mend process_array_subtract;
data want;
set test_data;
* if _n_ > 10000 then stop;
curr_dt_var = dt_var;
array app[5] app_1-app_5;
if _n_ eq 1 then do;
declare hash td(ordered:'a');
td.defineKey('dt_var');
td.defineData('dt_var','app_1','app_2','app_3','app_4','app_5');
td.defineDone();
end;
rc_a = td.add();
*start macro territory;
%process_array_add(array=24h);
%process_array_add(array=1wk);
%process_array_add(array=1mo);
%process_array_add(array=3mo);
%process_array_add(array=6mo);
%process_array_subtract(array=24h,period=DTDay, number=1);
%process_array_subtract(array=1wk,period=DTDay, number=7);
%process_array_subtract(array=1mo,period=DTMonth, number=1);
%process_array_subtract(array=3mo,period=DTMonth, number=3);
%process_array_subtract(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var rc: _:;
output;
run;

Here's a pure data step non-hash version. On my machine it's actually faster than the hash solution; I suspect it's not actually faster on a machine with a HDD (I have an SSD, so point access is not substantially slower than hash access, and I avoid having to load the hash). I would recommend using it if you don't know hashes very well or at all, as it'll be easier to troubleshoot, and it scales similarly. For most rows it accesses 11 rows, the current row and five other rows twice (one row, subtract it, then another row) for a total of around a million and a half total reads for 130k rows. (Compare that to about 17 billion reads for the cartesian...)
I suffix the macros with "_2" to differentiate them from the macros in the hash solution.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add_2(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract_2(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add_2(array=);
array app_&array. app_&array._1-app_&array._5; *define array;
%add_2(array=app_&array.); *add current row to array;
%mend process_array_add_2;
%macro process_array_sub_2(array=, period=, number=);
if _n_ eq 1 then do; *initialize point variable;
point_&array. = 1;
end;
else do; *do not have to do this _n_=1 as we only have that row;
set test_data point=point_&array.; *set the row that we may be subtracting;
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and point_&array. < _N_); *until we hit a row that is within the period...;
%subtract_2(array=app_&array.); *subtract the rows values;
point_&array. + 1; *increment the point to look at;
set test_data point=point_&array.; *set the new row;
end;
%mend process_array_sub_2;
data want;
set test_data;
*if _n_ > 10000 then stop; *useful for testing if you want to check time to execute;
curr_dt_var = dt_var; *save dt_var value from originally set record;
array app[5] app_1-app_5; *base array;
*start macro territory;
%process_array_add_2(array=24h); *have to do all of these adds before we start subtracting;
%process_array_add_2(array=1wk); *otherwise we have the wrong record values;
%process_array_add_2(array=1mo);
%process_array_add_2(array=3mo);
%process_array_add_2(array=6mo);
%process_array_sub_2(array=24h,period=DTDay, number=1); *now start checking to subtract what we need to;
%process_array_sub_2(array=1wk,period=DTDay, number=7);
%process_array_sub_2(array=1mo,period=DTMonth, number=1);
%process_array_sub_2(array=3mo,period=DTMonth, number=3);
%process_array_sub_2(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var _:;
output; *unneeded in this version but left for comparison to hash;
run;

calculate length of overlapping time intervals

Location Start End Diff
A 01:02:00 01:05:00 3
A 01:03:00 01:08:00 5
A 01:04:00 01:11:00 7
B 02:00:00 02:17:00 17
B 02:10:00 02:20:00 10
B 02:11:00 02:15:00 4
Ideal Output:
Location OverlapTime(Min) OverlapRecords
A 6 3
B 11 3
If each time only two records within the same location overlaps, then I can do it via lag
data want;
set have;
prevend = lag1(End);
if first.location then prevend = .;
if start < prevend then overlap = 1;else overlap = 0;
overlaptime = -(start-prevend);
by location notsorted;
run;
proc sql;
select location, sum(overlaptime), sum(overlap)
from want
group by location;
But the thing is, I have so many(unknown) overlapping time intervals within same location. How can I achieve this?

Here's my solution. There's no need to use the lag function, it can be done with the retain statement along with first and last.
/* create dummy dataset */
data have;
input Location $ Start :time. End :time.;
format start end time.;
datalines;
A 01:02:00 01:05:00
A 01:03:00 01:08:00
A 01:04:00 01:11:00
A 01:13:00 01:15:00
B 02:00:00 02:17:00
B 02:10:00 02:20:00
B 02:11:00 02:15:00
C 01:25:00 01:30:00
D 01:45:00 01:50:00
D 01:51:00 01:55:00
;
run;
/* sort data if necessary */
proc sort data = have;
by location start;
run;
/* calculate overlap */
data want;
set have;
by location start;
retain _stime _etime _gap overlaprecords; /* retain temporary variables from previous row */
if first.location then do; /* reset temporary and new variables at each change in location */
_stime = start;
_etime = end;
_gap = 0;
_cumul_diff = 0;
overlaprecords=0;
end;
_cumul_diff + (end-start); /* calculate cumulative difference between start and end */
if start>_etime then _gap + (start-_etime); /* calculate gap between start time and previous highest end time */
if not first.location and start<_etime then do; /* count number of overlap records */
if overlaprecords=0 then overlaprecords+2; /* first overlap record gets count of 2 to account for previous record */
else overlaprecords + 1;
end;
if end>_etime then _etime=end; /* update _etime if end is greater than current value */
if last.location then do; /* calculate overlap time when last record for current location is read */
overlaptime = intck('minute',(_etime-_stime-_gap),_cumul_diff);
output;
end;
drop _: start end; /* drop unanted variables */
run;

earliest start date for unique id

ESN is an id column that has multiple observations per esn, so repeated values of esn occur. For a given esn, I want to find the earliest service start date (and call it first), and I want to find the proper end date (called last) the if/then statement for how "last" is chosen is correct, but I get the following errors when I run the code below:
340 first = min(of start(*));
---
71
ERROR 71-185: The MIN function call does not have enough arguments.
here is the code I used
data three_1; /*first and last date created ?? used to ignore ? in data*/
set three;
format first MMDDYY10. last MMDDYY10.;
by esn;
array start(*) service_start_date;
array stop(*) service_end_date entry_date_est ;
do i=1 to dim(start);
first = min(of start(*));
end;
do i=1 to dim(stop);
if esn_status = 'Cancelled' then last = min(input(service_end_date, MMDDYY10.), input(entry_date_est, MMDDYY10.));
else last = max(input(service_end_date, MMDDYY10.), input(entry_date_est, MMDDYY10.));
end;
run;
"esn" "service_start_date" "service_end_date" "entry_date_est" "esn_status"
1 10/12/2010 01/01/2100 10/12/2012 cancelled
1 05/02/2009 02/12/2010 10/09/2012 cancelled
1 04/05/2011 03/04/2100 10/02/2012 cancelled
the results should be first= 05/02/2009 and last=10/12/2012

Arrays and the min(), max(), etc. functions operate horizontally across rows of a data set, not vertically across multiple records.
Assuming esn_status is constant for a given esn, then you need to sort your input by esn and service_start_date. You can use a data step to collect the values you want.
data three; /*thanks Joe for the data step to create the example data*/
length esn_status $10;
format service_start_date service_end_date entry_date_est MMDDYY10.;
input esn (service_start_date service_end_date entry_date_est) (:mmddyy10.) esn_status $;
datalines;
1 10/12/2010 01/01/2100 10/12/2012 cancelled
1 05/02/2009 02/12/2010 10/09/2012 cancelled
1 04/05/2011 03/04/2100 10/02/2012 cancelled
;;;;
run;
proc sort data=three;
by esn service_start_date;
run;
data three_1(keep=esn esn_status start last);
set three;
format start last date9.;
by esn;
retain start last;
if first.esn then do;
start = service_start_date;
last = service_end_date;
end;
if esn_status = "cancelled" then
last = min(last,service_end_date,entry_date_est);
else
last = max(last,service_end_date,entry_date_est);
if last.esn then
output;
run;

A DoW loop will get to where you want, or you could do it in SQL. Your desired results don't match with the actual results as far as I can tell, so you may need to make some adjustments. You'd need a second WANT dataset for the non-cancelled folks, I don't think there's an easy way to put it in one data step.
data have;
length esn_status $10;
format service_start_date service_end_date entry_date_est MMDDYY10.;
input esn (service_start_date service_end_date entry_date_est) (:mmddyy10.) esn_status $;
datalines;
1 10/12/2010 01/01/2100 10/12/2012 cancelled
1 05/02/2009 02/12/2010 10/09/2012 cancelled
1 04/05/2011 03/04/2100 10/02/2012 cancelled
;;;;
run;
data want_cancelled;
first = 99999;
last = 99999;
do _n_ = 1 by 1 until (last.esn);
set have(where=(esn_status='cancelled'));
by esn;
first = min(first,service_start_date);
last = min(last,service_end_date,entry_date_est);
end;
output;
keep first last esn;
format first last mmddyy10.;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS adding new observations for longitudinal data - sas

Related

Aggregating multiple observations depending on validity ranges

How to average a subset of data in SAS conditional on a date?

SAS Macro error with call execute

calculate length of overlapping time intervals

earliest start date for unique id

Categories

Resources