Parsing date periods and summarise days - sas

I'm asking support to manage something I'm able to manage with R but not in SAS and I must in SAS.
Suppose to deal with the following dataset:
ID Start End Label
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
....... ........ ........ ..........
I need the following:
for repeated entries after sorting Start and End so that the previous Start-End represents a time period before the subsequent:
if the difference between the Start and End is 1 day for the same repeated entry (ID) then sum the number of days, otherwise (>1 days) count without sum. This for holidays, maternity and disease.
if the difference between the Start and End is 0 days (consecutive) for the same repeated entry (ID) then sum the number of days, otherwise if > 1 days sum without count. This for holidays, maternity and disease.
For not repeated entries:
count the days;
count 1 day if the same Start-End.
Desired output:
ID Days Label Flag
subjectA 20 holidays summarised
subjectB 37 holidays summarised
subjectC 1 work_permit single_day
subjectD 184 maternity consecutive_not_summarized
subjectE 19 disease summarised
subjectF 1 work_permit single_day
subjectG 10 course consecutive_not_summarized
....... ........ ........ ..........
For ranges involving February, it was of 29 days. Moreover there might be more than two periods per repeated record.
Sorry it seems to be complex. I have no idea how to start writing this in SAS and so I need support and guide.
Thank you in advance

To calculate the days in the period just subtract start from end and add one. To calculate the gap between periods use the LAG() of end. Make sure to reset the calculated gap when starting a new ID.
data have;
input ID :$20. Start :ddmmyy. End :ddmmyy. Label :$20.;
format start end yymmdd10.;
cards;
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
;
data days;
set have;
by id;
days = end - start + 1 ;
gap = start - lag(end);
if first.id then gap=. ;
run;
Result
Obs ID Start End Label days gap
1 subjectA 2020-01-01 2020-01-15 holidays 15 .
2 subjectA 2020-01-16 2020-01-20 holidays 5 1
3 subjectB 2020-05-01 2020-05-30 holidays 30 .
4 subjectB 2020-06-01 2020-06-07 holidays 7 2
5 subjectC 2020-02-01 2020-02-01 work_permit 1 .
6 subjectD 2020-03-01 2020-09-01 maternity 185 .
7 subjectE 2020-01-03 2020-01-09 disease 7 .
8 subjectE 2020-01-11 2020-01-13 disease 3 2
9 subjectF 2020-02-12 2020-02-12 work_permit 1 .
10 subjectG 2020-09-11 2020-09-20 course 10 .
But I cannot figure out what you want to do when the gap is larger than 1 and your example data and results do not really provide any real guidance. For most cases you seem to just want the sum of the days whether or not there is a large gap.
proc summary data=days;
by id label;
var days;
output out=want sum= ;
run;
Result:
Obs ID Label _TYPE_ _FREQ_ days
1 subjectA holidays 0 2 20
2 subjectB holidays 0 2 37
3 subjectC work_permit 0 1 1
4 subjectD maternity 0 1 185
5 subjectE disease 0 2 10
6 subjectF work_permit 0 1 1
7 subjectG course 0 1 10
If you want to exclude periods that are more than 1 day after the previous period you could just add a WHERE clause.
proc summary data=days;
where gap < 2;
by id label;
var days;
output out=want sum= ;
run;
Results:
Obs ID Label _TYPE_ _FREQ_ days
1 subjectA holidays 0 2 20
2 subjectB holidays 0 1 30
3 subjectC work_permit 0 1 1
4 subjectD maternity 0 1 185
5 subjectE disease 0 1 7
6 subjectF work_permit 0 1 1
7 subjectG course 0 1 10
If the goal is not collapse the intervals in periods without gaps then make a new variable to indicate when a new period starts.
data days;
set have;
by id;
days = end - start + 1 ;
gap = start - lag(end);
period + (gap > 1);
if first.id then do;
gap=. ;
period=1;
end;
run;
proc summary data=days ;
by id period label ;
var days;
output out=want sum=;
run;
Now subjects B and E have two periods and the other examples only one.
Results
Obs ID period Label _TYPE_ _FREQ_ days
1 subjectA 1 holidays 0 2 20
2 subjectB 1 holidays 0 1 30
3 subjectB 2 holidays 0 1 7
4 subjectC 1 work_permit 0 1 1
5 subjectD 1 maternity 0 1 185
6 subjectE 1 disease 0 1 7
7 subjectE 2 disease 0 1 3
8 subjectF 1 work_permit 0 1 1
9 subjectG 1 course 0 1 10

Strong suspicion this won't scale because you're not accounting for weekends/days off allowed at work, ie Statutory holidays, office closures, consecutive days rules doesn't seem applied consistently.
But here's a start. You can modify as needed or update your question with more details. Please include data as a data step in future questions (like what I have in the first step of my answer below).
data have;
input ID : $14. Start : ddmmyy10. End : ddmmyy10. Label : $20.;
format start end date9.;
cards;
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
;
;
;;
run;
data groups;
set have;
by id label notsorted;
prev_end=lag(end);
if first.id then
do;
group=0;
call missing(prev_end);
end;
if first.label or prev_end+1 ne start then
group+1;
duration=end-start+1;
run;
proc means data=groups noprint nway;
class id label group;
var duration;
output out=summarized N=Number_Events Sum=Total_Duration;
run;
data want;
set groups;
IF n>2 AND LABEL not in ('work_permit', 'maternity') then flag = 'Summarized';
else if duration = 1 then flag = 'Single Day';
else flag = 'consecutive_not_summarized';
run;

Related

Add and update flags based on dates

I have the following data set:
ID Start Stop
001 01JAN2013 31JAN2013
001 01FEB2013 31DEC2013
002 01MAR2013 31DC2013
003 01JAN2013 31DEC2013
I need the following output:
ID Start Stop Start_flag End_flag
001 01JAN2013 31JAN2013 1 2
001 01FEB2013 31DEC2013 2 3
002 01MAR2013 31DC2013 1 2
003 01JAN2013 31DEC2013 1 2
In other words I need to add a flag for the start and end with the exception that for consecutive periods the end flag of the previous period will become the start flag of the subsequent period and the remaining end flag will be increased by 1.
Can anyone help me please?
Thnk you in advance
Use the LAG() function
proc sort data=have; by id start; run;
data want(drop=lag_stop);
set have;
by id start notsorted;
lag_stop = lag(stop);
if first.id then do;
start_flag=1;
end_flag=start_flag+1;
end;
else if lag_stop+1 = start then do;
start_flag+1;
end_flag+1;
end;
run;
want
id start stop start_flag end_flag
001 01JAN2013 31JAN2013 1 2
001 01FEB2013 31DEC2013 2 3
002 01MAR2013 31DEC2013 1 2
003 01JAN2013 31DEC2013 1 2

How to fetch a value based on another column

Please find the dataset below:
ID amt order_type no_of_order
1 200 em 6
1 300 on 5
2 600 em 10
Output desired:
ID amt order_type no_of_order
1 500 on 11
2 600 em 10
based on the highest amount i need to pick the order_type.
How can this be achieved in sas code
Sounds like you want to get the sum of the two numeric variables for each value of ID and also select one value for ORDER_TYPE. You appear to want to take the value of ORDER_TYPE which had the largest AMT. Here is simply way using PROC SUMMARY.
data have;
input ID amt order_type $ no_of_order;
cards;
1 200 em 6
1 300 on 5
2 600 em 10
;
proc summary data=have ;
by id;
var amt no_of_order;
output out=want sum= idgroup(max(amt) out[1] (order_type)=);
run;
Results:
no_of_ order_
Obs ID _TYPE_ _FREQ_ amt order type
1 1 0 2 500 11 on
2 2 0 1 600 10 em

PROC TRANSPOSE value column while retaining dates and Hour End

I have data structured like this:
Meter_ID Date HourEnd Value
100 12/01/2007 1 986
100 12/01/2007 2 992
100 12/01/2007 3 1002
200 12/01/2007 1 47
200 12/01/2007 2 45
200 12/01/2007 3 50
300 12/01/2007 1 32
300 12/01/2007 2 37
300 12/01/2007 3 40
And would like to transpose the information so that I end up with this:
Date HourEnd Meter100 Meter200 Meter300
12/01/2007 1 986 47 32
12/01/2007 2 992 45 37
12/01/2007 3 1002 50 40
I have tried numerous PROC TRANSPOSE options and variations and am confusing myself. Any help would be greatly appreciated!
You need to SORT.
data have;
infile cards firstobs=2;
input Meter_ID Date:mmddyy. HourEnd Value;
format date mmddyy10.;
cards;
Meter_ID Date HourEnd Value
100 12/01/2007 1 986
100 12/01/2007 2 992
100 12/01/2007 3 1002
200 12/01/2007 1 47
200 12/01/2007 2 45
200 12/01/2007 3 50
300 12/01/2007 1 32
300 12/01/2007 2 37
300 12/01/2007 3 40
;;;;
run;
proc print;
proc sort data=have;
by date hourend meter_id;
run;
proc print;
run;
proc transpose prefix="Meter"n;
by date hourend;
id meter_id;
var value;
run;
proc print;
run;

USING retain statement

Assume you have a temporary SAS data set called EPISODES that contains information about hospital episodes. The data set contains the variables ID_NO (patient ID), ADMIT_DATE (date of admission), DISC_DATE (date of discharge), and TOTAL_COST.
Using this data set, create a new data set in which you will create a separate observation for each day of each hospital episode. If a patient had a hospital episode that was 3 days long, they would have three views in the new data set from that episode -- one for each day.
Each observation in the new data set should have only three variables: the patient identifier ID_NO, the date for that particular day of hospitalization XDATE, and the cost for that day of hospitalization DAILY_COST = TOTAL_COST divided by the number of days in the episode.
My thought is to do this as a loop. Something like the following.
data new_data;
set input_data ;
do xdate = admit_date to disc_data;
daily_cost = .... ;
output new_data ( keep = xdate daily_cost id_no );
end;
run;
*This program block sets up our data set;
data episodes;
INPUT ID_NO $ ADMIT_DATE mmddyy10. TOTAL_COST DISC_DATE mmddyy10.;
DATALINES;
1 01/01/2017 3000 01/03/2017
2 01/01/2017 14000 01/14/2017
;
run;
data new_episodes (keep= ID_NO XDATE DAILY_COST);
set episodes;
NUM_DAYS= DISC_DATE-ADMIT_DATE;
DAILY_COST= TOTAL_COST/(DISC_DATE-ADMIT_DATE);
*Using the Do While loop to create a matrix of date observations;
XDATE=ADMIT_DATE;*initializing our variable;
do while(XDATE<DISC_DATE);
put XDATE=;
XDATE+1;
output;*outputting the date variable;
end;
format XDATE mmddyy10.;
run;
proc print data=new_episodes;
run;
Since SAS stores dates as number of days you can just use a DO loop to increment XDATE from ADMIT_DATE to DISC_DATE.
But you need to decide how to count dates. If you are admitted on Monday and discharged on Tuesday is that one day or two days? If it is one day then do you want XDATE records for Monday or Tuesday? Or both?
Let's make some test data:
data have;
input id_no $ admit_date :yymmdd. total_cost disc_date :yymmdd.;
format admit_date disc_date yymmdd10.;
put (_all_) (+0);
datalines;
1 2017-01-01 3000 2017-01-04
2 2017-01-01 5000 2017-01-06
3 2020-02-23 500 2020-02-23
;
Here is code that treats Monday to Tuesday as one day. So it doesn't output the discharge date (unless it is the same as the admission date).
data want;
set have ;
if admit_date=disc_date then daily_cost=total_cost;
else daily_cost = total_cost / (disc_date - admit_date);
do xdate=admit_date to max(admit_date,disc_date-1) ;
output;
end;
keep id_no xdate daily_cost;
format xdate yymmdd10.;
run;
Results:
daily_
Obs id_no cost xdate
1 1 1000 2017-01-01
2 1 1000 2017-01-02
3 1 1000 2017-01-03
4 2 1000 2017-01-01
5 2 1000 2017-01-02
6 2 1000 2017-01-03
7 2 1000 2017-01-04
8 2 1000 2017-01-05
9 3 500 2020-02-23
If you want to treat a stay from Monday to Tuesday as 2 days then the code is easier.
data want;
set have ;
daily_cost = total_cost / (disc_date - admit_date + 1);
do xdate=admit_date to disc_date ;
output;
end;
keep id_no xdate daily_cost;
format xdate yymmdd10.;
run;
Results:
daily_
Obs id_no cost xdate
1 1 750.000 2017-01-01
2 1 750.000 2017-01-02
3 1 750.000 2017-01-03
4 1 750.000 2017-01-04
5 2 833.333 2017-01-01
6 2 833.333 2017-01-02
7 2 833.333 2017-01-03
8 2 833.333 2017-01-04
9 2 833.333 2017-01-05
10 2 833.333 2017-01-06
11 3 500.000 2020-02-23

Converting daily data in to weekly in Pandas

I have a dataframe as given below:
Index Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
I want to convert daily data into weekly,grouped by anatomy,method being sum.
Itried resampling,but the output gave Multi Index data frame from which i was not able to access "Country" and "Date" columns(pls refer above)
The desired output is given below:
Date Country Occurence
Week1 India 4
Week2
Week1 US 2
Week2
Week5 Germany 5
You can groupby on country and resample on week
In [63]: df
Out[63]:
Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
In [64]: df.set_index('Date').groupby('Country').resample('W', how='sum')
Out[64]:
Occurence
Country Date
India 2014-01-05 3
2014-01-12 NaN
2014-01-19 1
UK 2014-02-09 5
US 2014-01-05 1
2014-01-12 1
And, you could use reset_index()
In [65]: df.set_index('Date').groupby('Country').resample('W', how='sum').reset_index()
Out[65]:
Country Date Occurence
0 India 2014-01-05 3
1 India 2014-01-12 NaN
2 India 2014-01-19 1
3 UK 2014-02-09 5
4 US 2014-01-05 1
5 US 2014-01-12 1