SAS outputting new variable missing full name - sas

I am rearranging some data to run in a linear regression and used the code below to change. When I look at the new dataset however the New Names are incomplete. For example Day11 shows are Day1, Day14 as Day1, Day20 as Day2 ect. Where is my code causing this problem?
Data EMU116Linear;
Set EMU116;
TumorVolume = Day0; Day = "Day0"; output;
TumorVolume = Day3; Day = "Day3"; output;
TumorVolume = Day7; Day = "Day7"; output;
TumorVolume = Day9; Day = "Day9"; output;
TumorVolume = Day11; Day = "Day11"; output;
TumorVolume = Day14; Day = "Day14"; output;
TumorVolume = Day16; Day = "Day16"; output;
TumorVolume = Day18; Day = "Day18"; output;
TumorVolume = Day21; Day = "Day21"; output;
drop Day0 Day3 Day7 Day9 Day11 Day14 Day16 Day18 Day21 TumorWeightDay21;
Run;
Original Data Set
New Data Set

SAS will truncate character values to the variable's designated length. Since you do not explicitly specify length, the very first assignment of Day0 designates the length of the new Day variable to 4 characters. Afterwards, any supplied, longer value to Day variable will truncate to 4 (i.e., Day11 to Day1). To accommodate 5 characters and to initialize new variables use length directive:
Data EMU116Linear;
Set EMU116;
length TumorVolume Day $5;
...
run
However, from your code and desired result a better less repetitive solution may be proc_transpose to reshape your wide data to long format:
proc sort data=EMU116Linear;
by Treatment Treatment2 TreatCode Mouse;
run;
proc transpose
data=EMU116Linear
out=EMU116;
by Treatment Treatment2 TreatCode Mouse;
run;
data EMU116;
set EMU116;
rename col1=TumorVolume
_name_=Day;
run;

Related

Aggregating multiple observations depending on validity ranges

I am facing a problem regarding the concatenation of multiple observations depending on validity ranges. The function I am trying to reproduce is similar to the Listagg() function in Oracle but I want to use it with regards to validity ranges.
Here is a reproducible minimal dataset:
data have;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
3,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,10JAN2016:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y,01JAN2019:00:00:00,31DEC2019:00:00:00
5,MEMBER,N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
                            
What I would like to do is to concatenate the value variable for each group of var inside an id.
However, there a multiple types of cases:
If there is only one value for a given var inside an id, don't do anything (e.g. case of id=1 in my example)
If validity ranges are consecutive, output every value of var inside an id (e.g. case of id=2)
If validity ranges are the same for the same var inside an id, concatenate them altogether (e.g. case of id=3)
If validity ranges are overlapping, the range that shares the two value of var are concatenated with the corresponding validity range (e.g. case of id=4)
If there are multiple validity ranges that are non consecutive for the same value in var inside an id, concatenate each that shares the same validity ranges (e.g. case of id=5)
Here is the desired result:
                            
Following #Kiran's answer on how to do Listagg function in SAS and #Joe's answer on List Aggregation and Group Concatenation in SAS Proc SQL, I tried to use the CATX function.
This is my attempt:
proc sort data=have;
by id var start;
run;
data staging1;
set have;
by id var start;
if first.var then group_number+1;
run;
/* Simulate LEAD() function in SAS */
data staging2;
merge staging1 staging1(firstobs = 2
keep=group_number start end
rename=(start=lead_start end=lead_end group_number=nextgrp));
if group_number ne nextgrp then do;
lead_start = .;
lead_end = .;
end;
drop nextgrp;
format lag_: datetime20.;
run;
proc sort data=staging2;
by id var group_number start;
run;
data want;
retain _temp;
set staging2;
by id var group_number;
/* Only one obs for a given variable, output directly */
if first.group_number = 1 and last.group_number = 1 then
output;
else if first.group_number = 1 and last.group_number = 0 then
do;
if lead_start ne . and lead_end ne .
and ((lead_start < end) or (lead_end < start)) then
do;
if (lead_start = start) or (lead_end = end) then
do;
retain _temp;
_temp = value;
end;
if (lead_start ne start) or (lead_end ne end) then
do;
_temp = value;
end = intnx('dtday',lead_start,-1);
output;
end;
end;
else if lead_start ne . and lead_end ne . and intnx('dtday', end, 1) = lead_start then
do;
_temp = value;
output;
end;
else output;
end;
else if first.group_number = 0 and last.group_number = 1 then
do;
/* Concatenate preceded retained value */
value = catx(";",_temp, value);
output;
call missing(_temp);
end;
else output;
drop _temp lead_start lead_end group_number;
run;
My attempt did not solve all the problems. Only the cases of id=1 and id=3 were correctly output. I am starting to think that the use of first. and last. as well as the simulated LEAD() function might not be the most optimal one and that there is a probably a better way to do this.
Result of my attempt:
                            
Desired results in data:
data want;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,09JAN2016:00:00:00
4,MEMBER,Y;N,10JAN2016:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,01JAN2018:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y;N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
It's pretty hard to do this in raw SQL, without built in windowing functions; data step SAS will have some better solutions.
Some of this depends on your data size. One example, below, does exactly what you ask for, but it probably will be impractical with your real data. Some of that is the 31DEC9999 dates - that makes for a lot of data - but even without that, this has thousands of rows per person, so if you have a million people or something this will get rather large. But, it might still be the best solution, depending on what you need - it does give you the absolute best control.
* First, expand the dataset to one row per day/value. (Hopefully you do not need the datetime - just the date.)
data daily;
set have;
do datevar = datepart(start) to datepart(end);
output;
end;
format datevar date9.;
drop start end;
run;
proc sort data=daily;
by id var datevar value;
run;
*Now, merge together the rows to one row per day - so days with multiple values will get merged into one.;
data merging;
set daily;
by id var datevar;
retain merge_value;
if (first.datevar and last.datevar) then output;
else do;
if first.datevar then merge_value = value;
else merge_value = catx(',',merge_value,value);
if last.datevar then do;
value = merge_value;
output;
end;
end;
keep id var datevar value;
run;
proc sort data=merging;
by id var value datevar;
run;
*Now, re-condense;
data want;
set merging;
by id var value datevar;
retain start end;
last_datevar = lag(datevar);
if first.value then do;
start = datevar;
end = .;
end;
else if last_datevar ne (datevar - 1) then do;
end = last_datevar;
output;
start = datevar;
end = .;
end;
if last.value then do;
end = datevar;
output;
end;
format start end date9.;
run;
I do not necessarily recommend doing this - it's provided for completeness, and in case it turns out it's the only way to do what you do.
Easier, most likely, is to condense using the data step using an event level dataset, where 'start' and 'end' are events. Here's an example that does what you require; it translates the original dataset to only 2 rows per original row, and then uses logic to decide what should happen for each event. This is pretty messy, so you'd want to clean it up for production, but the idea should work.
* First, make event level dataset so we can process the start and end separately;
data events;
set have;
type = 'Start';
dt_event = start;
output;
type = 'End';
dt_event = end;
output;
drop start end;
format dt_event datetime.;
run;
proc sort data=events;
by id var dt_event value;
run;
*Now, for each event, a different action is taken. Starts and Ends have different implications, and do different things based on those.;
data want;
set events(rename=value=in_value);
by id var dt_event;
retain start end value orig_value;
format value new_value $8.;
* First row per var is easy, just start it off with a START;
if first.var then do;
start = dt_event;
value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current VALUE from the concatenated VALUE string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = value;
do _i = 1 to countw(value,',');
if scan(orig_value,_i,',') ne in_value then new_value = catx(',',new_value,scan(orig_value,_i,','));
end;
orig_value = new_value;
if last.dt_event then do;
end = dt_event;
output;
start = dt_event + 86400;
value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end = dt_event - 86400;
if start < end and not missing(value) then output;
value = catx(',',value,in_value);
start = dt_event;
end = .;
end;
end;
format start end datetime21.;
keep id var value start end;
run;
Last, I'll leave you with this: you probably work in insurance, pharma, or banking, and either way this is a VERY solved problem - it's done a lot (this sort of windowing). You shouldn't really be writing new code here for the most part - first look in your company, and then if not, look for papers in either PharmaSUG or FinSUG or one of the other SAS user groups, where they talk about this. There's probably several dozen implementations of code that does this already published.

How to average a subset of data in SAS conditional on a date?

I'm trying to write SAS code that can loop over a dataset that contains event dates that looks like:
Data event;
input Date;
cards;
20200428
20200429
;
run;
And calculate averages for the prior three-days from another dataset that contains dates and volume that looks like:
Data vol;
input Date Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
For example, for date 20200428 the average should be 88.33 [(95+80+90)/3] and for date 20200429 the average should be 87.00 [(86+95+80)/3]. I want these values and the volume of the date to be saved on a new dataset that looks like the following if possible.
Data clean;
input Date Vol Avg;
cards;
20200428 86 88.33
20200429 110 87.00
;
run;
The actual data that I'm working with is from 1970-2010. I may also increase my average period from 3 days prior to 10 days prior, so I want to have flexible code. From what I've read I think a macro and/or call symput might work very well for this, but I'm not sure how to code these to do what I want. Honestly, I don't know where to start. Can anyone point me in the right direction? I'm open to any advice/ideas. Thanks.
A SQL statement is by far the most succinct code for obtaining your result set.
The query will join with 2 independent references to volume data. The first for obtaining the event date volume, and the second for computing the average volume over the three prior days.
The date data should be read in as a SAS date, so that the BETWEEN condition will be correct.
Data event;
input Date: yymmdd8.;
cards;
20200428
20200429
;
run;
Data vol;
input Date: yymmdd8. Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
* SQL query with GROUP BY ;
proc sql;
create table want as
select
event.date
, volume_one.volume
, mean(volume_two.volume) as avg
from event
left join vol as volume_one
on event.date = volume_one.date
left join vol as volume_two
on volume_two.date between event.date-1 and event.date-3
group by
event.date, volume_one.volume
;
* alternative query using correlated sub-query;
create table want_2 as
select
event.date
, volume
, ( select mean(volume) as avg from vol where vol.date between event.date-1 and event.date-3 )
as avg
from event
left join vol
on event.date = vol.date
;
For the case of the Volumes data being date gapped, a better solution would be to separately compute the rolling average of N prior volumes. The date gaps could be from weekends, holidays, or a date not present due to data entry problems or operator error. Conceptually, for the averaging, the only role of date is only to order the data.
After the rolling averages are computed, a simple join or merge can be done.
Example:
* Simulate some volume data that excludes weekends, holidays, and a 2% rate of missing dates;
data volumes(keep=date volume);
call streaminit(20200502);
do date = '01jan1970'd to today();
length holiday $25;
year = year(date);
holiday = 'NEWYEAR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USINDEPENDENCE'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'THANKSGIVING'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'CHRISTMAS'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'MEMORIAL'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'LABOR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'EASTER'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USPRESIDENTS'; hdate = holiday(holiday, year); if date=hdate then continue;
if weekday(date) in (1,7) then continue; *1=Sun, 7=Sat;
volume = 100 + ceil(75 * sin (date / 8));
if rand('uniform') < 0.02 then continue;
output;
end;
format date yymmdd10.;
run;
* Compute an N item rolling average from N prior values;
%let ROLLING_N = 5;
data volume_averages;
set volumes;
by date; * enforce sort order requirement;
array v[0:&ROLLING_N] _temporary_; %* <---- &ROLLING_N ;
retain index -1;
avg_prior_&ROLLING_N. = mean (of v(*)); %* <---- &ROLLING_N ;
OUTPUT;
index = mod(index + 1,&ROLLING_N); %* <---- Modular arithmetic, the foundation of rolling ;
v[index] = volume;
format v: 6.;
drop index;
run;
* merge;
data want_merge;
merge events(in=event_date) volume_averages;
by date;
if event_date;
run;
* join;
proc sql;
create table want_join as
select events.*, volume_averages.avg_prior_5
from events join volume_averages
on events.date = volume_averages.date;
quit;
You want to loop over a series of dates in an input data set. Therefore I use a PROC SQL statement where I select the distinct dates in this input data set into a macro variable.
This macro variable is then used to loop over. In your example the macro variable will thus be: 20200428 20200429. You can then use the %SCAN macro function to start looping over these dates.
For each date in the loop, we will then calculate the average: in your example the average of the 3 days prior to the looping date. As the number of days for which you want to calculate the average is variable, this is also passed as a parameter in the macro. I then use the INTNX function to calculate the lower bound of dates you want to select to calculate the average over. Then the PROC MEANS procedure is used to calculate the average volume over the days: lower bound - looping date.
I then put a minor data step in between to attach the looping date again to the calculated average. Finally everything is appended in a final data set.
%macro dayAverage(input = , range = , selectiondata = );
/* Input = input dataset
range = number of days prior to the selected date for which you want to calculate
the average
selectiondata = data where the volumes are in */
/* Create a macro variable with the dates for which you want to calculate the
average, to loop over */
proc sql noprint;
select distinct date into: datesrange separated by " "
from &input.;
quit;
/*Start looping over the dates for which you want to calculate the average */
%let I = 1;
%do %while (%scan(&datesrange.,&I.) ne %str());
/* Assign the current date in the loop to the variable currentdate */
%let currentdate = %scan(&datesrange.,&I.);
/* Create the minimum date in the range based on input parameter range */
%let mindate =
%sysfunc(putn(%sysfunc(intnx(day,%sysfunc(inputn(&currentdate.,yymmdd8.)),-
&range.)),yymmddn8.));
/* Calculate the mean volume for the selected date and selected range */
proc means data = &selectiondata.(where = (date >= &mindate. and date <
&currentdate.)) noprint ;
output out = averagecurrent(drop = _type_ _freq_) mean(volume)=avgerage_volume;
run;
/* Add the current date to the calculated average */
data averagecurrent;
retain date average_volume;
set averagecurrent;
date = &currentdate.;
run;
/* Append the result to a final list */
proc datasets nolist;
append base = final data = averagecurrent force;
run;
%let I = %eval(&I. + 1);
%end;
%mend;
This macro can in your example be called as:
%dayAverage(input = event, range = 3, selectiondata = vol);
It will give you a data set in your work library called final

how to vertically sum a range of dynamic variables in sas?

I have a dataset in SAS in which the months would be dynamically updated each month. I need to calculate the sum vertically each month and paste the sum below, as shown in the image.
Proc means/ proc summary and proc print are not doing the trick for me.
I was given the following code before:
`%let month = month name;
%put &month.;
data new_totals;
set Final_&month. end=end;
&month._sum + &month._final;
/*feb_sum + &month._final;*/
output;
if end then do;
measure = 'Total';
&month._final = &month._sum;
/*Feb_final = feb_sum;*/
output;
end;
drop &month._sum;
run; `
The problem is this has all the months hardcoded, which i don't want. I am not too familiar with loops or arrays, so need a solution for this, please.
enter image description here
It may be better to use a reporting procedure such as PRINT or REPORT to produce the desired output.
data have;
length group $20;
do group = 'A', 'B', 'C';
array month_totals jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over month_totals;
month_totals = 10 + floor(rand('uniform', 60));
end;
output;
end;
run;
ods excel file='data_with_total_row.xlsx';
proc print noobs data=have;
var group ;
sum jan2020--dec2019;
run;
proc report data=have;
columns group jan2020--dec2019;
define group / width=20;
rbreak after / summarize;
compute after;
group = 'Total';
endcomp;
run;
ods excel close;
Data structure
The data sets you are working with are 'difficult' because the date aspect of the data is actually in the metadata, i.e. the column name. An even better approach, in SAS, is too have a categorical data with columns
group (categorical role)
month (categorical role)
total (continuous role)
Such data can be easily filtered with a where clause, and reporting procedures such as REPORT and TABULATE can use the month variable in a class statement.
Example:
data have;
length group $20;
do group = 'A', 'B', 'C';
do _n_ = 0 by 1 until (month >= '01feb2020'd);
month = intnx('month', '01jan2018'd, _n_);
total = 10 + floor(rand('uniform', 60));
output;
end;
end;
format month monyy5.;
run;
proc tabulate data=have;
class group month;
var total;
table
group all='Total'
,
month='' * total='' * sum=''*f=comma9.
;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
proc report data=have;
column group (month,total);
define group / group;
define month / '' across order=data ;
define total / '' ;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
Here is a basic way. Borrowed sample data from Richard.
data have;
length group $20;
do group = 'A', 'B';
array months jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over months;
months = 10 + floor(rand('uniform', 60, 1));
end;
output;
end;
run;
proc summary data=have;
var _numeric_;
output out=temp(drop=_:) sum=;
run;
data want;
set have temp (in=t);
if t then group='Total';
run;

in a SAS data step referencing another data set without a merge?

Grateful for feedback, I'm still a notice programmer. I'm trying to code the below in SAS.
I have two data sets a) and b), containing the following variables:
a) Bene_ID, county_id_1, county_id_2, county_id_3 etc (it's 12 months)
b) county_ID, rural (yes/no)
What I would normally do is create an array in a data step:
Array country (12) county_ID_1- county_ID_12
and use by group processing on bene_ID, to output a long (normalized) data set like this:
bene_id, month 1, county_id
bene_id, month 2, county_id
bene_id, month 3, county_id
etc.
BUT, how do I access the other data set b) within a data step? to pull in the rural variable? This is what I want:
bene_id, month 1, county_id, if rural = "yes"
bene_id, month 2, county_id, if rural = "yes"
bene_id, month 3, county_id, if rural = "yes"
I tried looking for other similar questions on this bulletin board but I wasn't even sure of the correct terms to search for. The reason I don't want to do a full merge is: how to filter on an array value? e.g. when rural = "no"?
Thanks everyone,
Lori
This is an example where using a FORMAT would help. You can use your second dataset to create a format
data formats;
retain fmtname 'rural';
set b;
rename county_id=start rural=label;
run;
proc format cntlin=formats ;
run;
and then use the format when processing the first dataset.
data want ;
set A;
array county_id_ [12];
do month=1 to dim(county_id_);
county=county_id_[month];
rural = put(county,rural3.);
output;
end;
drop county_id_: ;
run;
You are transforming the data structure from wide (array form) to tall (categorical form). This is generally known as a pivot or transpose. The transformation turns the information stored in each array element name (columns) into data that becomes accessible at the row-level.
You can merge the transpose with the counties to select rural ones.
* 80% of counties are rural;
data counties;
do countyId = 1 to 50;
if ranuni(123) < 0.80 then rural='Yes'; else rural='No';
output;
end;
run;
* for 10 people track with county they are in each month;
data have;
do personId = 1 to 10;
array countyId (12);
countyId(1) = ceil(50*ranuni(123));
do _n_ = 2 to dim(countyId);
if ranuni(123) < 0.15 then
countyId(_n_) = ceil(50*ranuni(123)); * simulate 15% chance of moving;
else
countyId(_n_) = countyId(_n_-1) ;
end;
output;
end;
run;
proc transpose data=have out=have_transpose(rename=(col1=countyId)) ;
by personId;
var countyId:;
run;
proc sort data=have_transpose;
by countyId personId;
run;
data want_rural;
merge have_transpose(in=tracking) counties;
by countyId;
if tracking and rural='Yes';
month = input(substr(_name_, length('countyId')+1), 8.);
drop _name_;
run;
If your wide data also has an additional a set of 12 columns, for say an array of amounts disbursed in each month, the best approach is to do 'DATA Step' transpose like #Tom showed, with an additional assignment inside the loop
amount = amount_[month];

SAS adding new observations for longitudinal data

I have a longitudinal dataset in SAS with periods of time categorized as either at risk for an event, or not at risk. Unfortunately, some time periods overlap, and I would like to recode them to have a dataset of entirely non-overlapping observations. For example, the dataset currently looks like:
Row 1: ID=123; Start=Jan 1, 1999; End=Dec 31, 1999; At_risk="Yes"
Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No"
The dataset I would like looks like:
Row 1: ID=123; Start=Jan 1, 1999; End=Feb 1, 1999; At_risk="Yes"
Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No"
Row 3: ID=123; Start=Feb 15, 1999; End=Dec 31, 1999; At_risk="Yes"
Thoughts?
Vasja may have been suggesting something like this (date level) as an alternative.
I will assume here that the most recent row read in your longitudinal dataset will have priority over any other rows with overlapping date ranges. If that is not the case then adjust the priority derivation below as appropriate.
Are you sure your start and end dates are correct. Your desired output still has overlapping dates. Feb 1 & 15 are both At Risk and not At Risk. Your End date should be at least one day before the next start date. Not the same day. End and Start dates should be contiguous. For that reason coding a solution that produces your desired output (with overlapping dates) is problematic. The solution below is based on no overlapping dates. You will need to modify it to include overlapping dates as per your required output.
/* Your longitudinal dataset . */
data orig;
format Id 16. Start End Date9.;
Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output;
Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output;
run;
/* generate a row for each date between start and end dates. */
/* Use row number (_n_) to assign priorioty. */
Data overlapping_dates;
set orig;
foramt date date9.;
priority = _n_;
do date = start to end by 1;
output;
end;
Run;
/* Get at_risk details for most recent read date according to priority. */
Proc sql;
create table non_overlapping_dates as
select id, date, at_risk
from overlapping_dates
group by id, date
having priority eq max (priority)
order by id, date
;
Quit;
/* Rebuild longitudinal dataset . */
Data longitudinal_dataset
(keep= id start end at_risk)
;
format id 16. Start End Date9. at_risk $3.;
set non_overlapping_dates;
by id at_risk notsorted;
retain start;
if first.at_risk
then start = date;
/* output a row to longitudinal dataset if at_risk is about to change or last row for id. */
if last.at_risk
then do;
end = date;
output;
end;
Run;
Such tasks are exercises in debugging of program logic and fighting data assumptions, playing with old/new values...
Below my initial code for the exact example you provided, will surely need some adjustment on real data.
In case there's time overlap on more than current-next record I'm not sure it's doable this way (with reasonable effort). For such cases you'd probably be more effective with splitting original start - end intervals to day level and then summarize details to new intervals.
data orig;
format Id 16. Start End Date9.;
Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output;
Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output;
run;
proc sort data = orig;
by ID Start;
run;
data modified;
format pStart oStart pEnd oEnd Date9.;
set orig;
length pStart pEnd 8 pAt_risk $3;
by ID descending End ;
retain pStart pEnd pAt_risk;
/* keep original values */
oStart = Start;
oEnd = End;
oAt_risk = At_risk;
if first.id then do;
pStart = Start;
pEnd = End;
pAt_risk = At_risk;
/* no output */
end;
else do;
if pAt_risk ne At_risk then do;
if Start > pStart then do;
put _all_;
Start = pStart;
End = oStart;
At_risk = pAt_risk;
output;/* first part of time span */
Start = oStart;
End = oEnd;
At_risk = oAt_risk;
output;/* second part of time span */
if (End < pEnd ) then do;
Start = End;
End = pEnd;
At_risk = pAt_risk;
output; /*third part of time span */
/* keep current values as previous record values */
pStart = max(oStart, Start);
pEnd = End;
pAt_risk = At_risk;
end;
end;
end;
end;
run;
proc print;run;