I'm a beginner in SAS.
I have 4 data sets that look like this:
Year_2013
ID Start End Role Hours
0001 01JAN2013 30APR2013 53100 300
0001 01MAY2013 31DEC2013 50100 3
0002 01JAN2013 31DEC2013 56100 60
0003 01JAN2013 31DEC2013 52100 123
.... ......... ......... ..... ....
Year_2014
ID Start End Role Hours
0001 01JAN2014 31DEC2014 59100 56
0002 01JAN2014 31NOV2014 56100 12
0002 01DEC2014 31DEC2014 58100 0
0003 01JAN2014 31DEC2014 52100 1,34
0004 01JAN2014 31JUL2014 51100 300
0004 01AUG2014 31DEC2014 50100 90
.... ......... ......... .... ....
Year_2015
ID Start End Role Hours
0001 01JAN2015 31MAR2015 59100 4
0001 01APR2015 01MAY2015 58100 0
0001 02MAY2015 31DEC2015 51100 34
0002 01JAN2015 01APR2015 55101 54
0002 01MAY2015 01JUN2015 56101 0
0002 01JUL2015 31DEC2015 56100 0
0003 01JAN2015 31DEC2015 55107 8
0004 01JAN2015 31DEC2015 50100 69
.... ......... ......... .....
Year_2016
ID Start End Role Hours
0001 01JAN2016 30SEP2016 51100 67
0001 01OCT2016 31DEC2016 52100 0
0002 01JAN2016 31DEC2015 56101 98
0003 01JAN2016 31DEC2016 50115 9
0004 01JAN2016 31JAN2016 51101 7
0004 01FEB2016 31DEC2016 51106 234
.... .......... ......... ......
I need to sum the hours in the "Hours" column only when data are consecutive among years and according to the Role and I need to add a flag when there is a change in Role. The desired output should be:
Years_all
ID Start End Role Hours
0001 01JAN2013 30APR2013 53100 300
0001 01MAY2013 31DEC2013 50100 3
0001 01JAN2014 31MAR2015 59100 60
0001 01APR2015 01MAY2015 58100 0
0001 02MAY2015 30SEP2016 51100 101
0001 01OCT2016 31DEC2016 52100 0
0002 01JAN2013 31NOV2014 56100 72
0002 01DEC2014 31DEC2014 58100 0
0002 01JAN2015 01APR2015 55101 54
0002 01MAY2015 01JUN2015 56101 0
0002 01JUL2015 31DEC2015 56100 0
0002 01JAN2016 31DEC2015 56101 98
0003 01JAN2013 31DEC2014 52100 124,34
0003 01JAN2015 31DEC2015 55107 8
0003 01JAN2016 31DEC2016 50115 9
0004 01JAN2014 31JUL2014 50100 300
0004 01AUG2014 31DEC2015 50100 159
0004 01JAN2016 31JAN2016 51101 7
0004 01FEB2016 31DEC2016 51106 234
So, for each ID, if the Role remains the same at the next consecutive year (first row of the consecutive year for each ID) then sum "Hours" and adjust the end date otherwise let rows as they are.The sum MUST not be applied inside each single year but only among consecutive years.
For example: in the output, for ID 004 I performed:90+69, i.e., 31DEC2014 50100 + 31DEC2015 50100 and updated the row in the output to 31DEC2015 because the ID has the same role in the next year (in this case full year).
A few things:
I changed 31NOV2014 to 30NOV2014 for obvious reasons.
I assume that your posted data is representative of the structure of your actual data.
Code:
data year_2013;
input ID (Start End)(:date9.) Role Hours;
format Start End date9.;
datalines;
0001 01JAN2013 30APR2013 53100 300
0001 01MAY2013 31DEC2013 50100 3
0002 01JAN2013 31DEC2013 56100 60
0003 01JAN2013 31DEC2013 52100 123
;
data year_2014;
input ID (Start End)(:date9.) Role Hours;
format Start End date9.;
datalines;
0001 01JAN2014 31DEC2014 59100 56
0002 01JAN2014 30NOV2014 56100 12
0002 01DEC2014 31DEC2014 58100 0
0003 01JAN2014 31DEC2014 52100 1.34
0004 01JAN2014 31JUL2014 51100 300
0004 01AUG2014 31DEC2014 50100 90
;
data year_2015;
input ID (Start End)(:date9.) Role Hours;
format Start End date9.;
datalines;
0001 01JAN2015 31MAR2015 59100 4
0001 01APR2015 01MAY2015 58100 0
0001 02MAY2015 31DEC2015 51100 34
0002 01JAN2015 01APR2015 55101 54
0002 01MAY2015 01JUN2015 56101 0
0002 01JUL2015 31DEC2015 56100 0
0003 01JAN2015 31DEC2015 55107 8
0004 01JAN2015 31DEC2015 50100 69
;
data year_2016;
input ID (Start End)(:date9.) Role Hours;
format Start End date9.;
datalines;
0001 01JAN2016 30SEP2016 51100 67
0001 01OCT2016 31DEC2016 52100 0
0002 01JAN2016 31DEC2015 56101 98
0003 01JAN2016 31DEC2016 50115 9
0004 01JAN2016 31JAN2016 51101 7
0004 01FEB2016 31DEC2016 51106 234
;
data temp;
set year_:;
run;
proc sort data = temp;
by ID start;
run;
data want(drop = _:);
do until (last.role);
set temp;
by ID role notsorted;
_start = min(start, _start);
_end = max(end, _end);
_hours = sum(hours, _hours);
end;
start = _start;
end = _end;
hours = _hours;
run;
Result:
ID Start End Role Hours
1 01JAN2013 30APR2013 53100 300.00
1 01MAY2013 31DEC2013 50100 3.00
1 01JAN2014 31MAR2015 59100 60.00
1 01APR2015 01MAY2015 58100 0.00
1 02MAY2015 30SEP2016 51100 101.00
1 01OCT2016 31DEC2016 52100 0.00
2 01JAN2013 30NOV2014 56100 72.00
2 01DEC2014 31DEC2014 58100 0.00
2 01JAN2015 01APR2015 55101 54.00
2 01MAY2015 01JUN2015 56101 0.00
2 01JUL2015 31DEC2015 56100 0.00
2 01JAN2016 31DEC2015 56101 98.00
3 01JAN2013 31DEC2014 52100 124.34
3 01JAN2015 31DEC2015 55107 8.00
3 01JAN2016 31DEC2016 50115 9.00
4 01JAN2014 31JUL2014 51100 300.00
4 01AUG2014 31DEC2015 50100 159.00
4 01JAN2016 31JAN2016 51101 7.00
4 01FEB2016 31DEC2016 51106 234.00
Let me start by saying that November only has 30 days.
Next time, please try to provide the tables in a SAS-friendly format
data year_2013;
infile datalines delimiter='|';
input id $ start : date9. end : date9. role $ hours 8.;
format start date9. end date9.;
datalines;
0001|01JAN2013|30APR2013|53100|300
0001|01MAY2013|31DEC2013|50100|3
0002|01JAN2013|31DEC2013|56100|60
0003|01JAN2013|31DEC2013|52100|123
;
data year_2014;
infile datalines delimiter='|';
input id $ start : date9. end : date9. role $ hours 8.;
format start date9. end date9.;
datalines;
0001|01JAN2014|31DEC2014|59100|56
0002|01JAN2014|30NOV2014|56100|12
0002|01DEC2014|31DEC2014|58100|0
0003|01JAN2014|31DEC2014|52100|1.34
0004|01JAN2014|31JUL2014|51100|300
0004|01AUG2014|31DEC2014|50100|90
;
data year_2015;
infile datalines delimiter='|';
input id $ start : date9. end : date9. role $ hours 8.;
format start date9. end date9.;
datalines;
0001|01JAN2015|31MAR2015|59100|4
0001|01APR2015|01MAY2015|58100|0
0001|02MAY2015|31DEC2015|51100|34
0002|01JAN2015|01APR2015|55101|54
0002|01MAY2015|01JUN2015|56101|0
0002|01JUL2015|31DEC2015|56100|0
0003|01JAN2015|31DEC2015|55107|8
0004|01JAN2015|31DEC2015|50100|69
;
data year_2016;
infile datalines delimiter='|';
input id $ start : date9. end : date9. role $ hours 8.;
format start date9. end date9.;
datalines;
0001|01JAN2016|30SEP2016|51100|67
0001|01OCT2016|31DEC2016|52100|0
0002|01JAN2016|31DEC2015|56101|98
0003|01JAN2016|31DEC2016|50115|9
0004|01JAN2016|31JAN2016|51101|7
0004|01FEB2016|31DEC2016|51106|234
;
I think the below should do the trick. Basically, it computes groups and finally collapse the observations by summing the hours for each group.
data have;
set year_2013 year_2014 year_2015 year_2016;
run;
proc sort data=have;
by id role descending start;
run;
data stage1;
set have;
retain group 0;
format lag_start date9.;
by id role start notsorted;
if first.id then
lag_start = .;
else lag_start=lag(start);
if first.id then
group+1;
else if lag_start-1 ne end or first.role then
group+1;
run;
data want(drop=hours group lag_start);
set stage1;
by group;
if first.group then
tot=0;
tot + hours;
if last.group then
output;
rename tot = hours;
run;
proc sort data=want;
by id start role;
run;
Results:
id start end role hours
0001 01JAN2013 30APR2013 53100 300
0001 01MAY2013 31DEC2013 50100 3
0001 01JAN2014 31DEC2014 59100 60
0001 01APR2015 01MAY2015 58100 0
0001 02MAY2015 31DEC2015 51100 101
0001 01OCT2016 31DEC2016 52100 0
0002 01JAN2013 31DEC2013 56100 72
0002 01DEC2014 31DEC2014 58100 0
0002 01JAN2015 01APR2015 55101 54
0002 01MAY2015 01JUN2015 56101 0
0002 01JUL2015 31DEC2015 56100 0
0002 01JAN2016 31DEC2015 56101 98
0003 01JAN2013 31DEC2013 52100 124.34
0003 01JAN2015 31DEC2015 55107 8
0003 01JAN2016 31DEC2016 50115 9
0004 01JAN2014 31JUL2014 51100 300
0004 01AUG2014 31DEC2014 50100 159
0004 01JAN2016 31JAN2016 51101 7
0004 01FEB2016 31DEC2016 51106 234
Related
suppose to have the following data set:
ID Label
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
and another one:
ID Label
0001 0001_1
0001 0001_1
0001 0001_2
0001 0001_2
0001 0001_3
0001 0001_3
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
You want the following:
if in the first dataset there is only one type of Label (i.e., 0001_1), the second dataset should have that type. Otherwise if there are multiple labels nothing must be done. The desired output should be:
ID Label
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
Thank you in advance
Best
You will want to compute the groups in the first table that have a single label in aggregate and apply that label to the groups in the second table.
Example:
Computation with PROC FREQ and application via MERGE.
data have1;
call streaminit(20231);
do id = 1 to 10;
do seq = 1 to rand('integer', 10) + 2;
if mod(id,2) = 0
then label = 'AAA';
else label = repeat(byte(64+rand('integer', 26)),2);
output;
end;
end;
run;
data have2;
call streaminit(20232);
do id = 1 to 10;
do seq = 1 to rand('integer', 12) + 2;
label = repeat(byte(64+rand('integer', 26)),2);
output;
end;
end;
run;
proc freq noprint data=have1;
by id;
table label / out=one_label(where=(percent=100));
run;
data want2;
merge
have2
one_label(keep=id label rename=(label=have1label) in=reassign)
;
by id;
if reassign then label = have1label;
drop have1label;
run;
Same result achieved with SQL code, performing computation in a sub-select and using COALESCE for application.
proc sql;
create table want2 as
select
have2.id
, coalesce(singular.onelabel, have2.label) as label
from
have2
left join
( select unique id, label as onelabel
from have1
group by id
having count(distinct label) = 1
) as singular
on
have2.id = singular.id
;
suppose to have the following:
ID Start
0001 31JAN2015
0001 31JAN2015
0003 16FEB2016
0006 01FEB2018
0004 31DEC2016
0004 31DEC2016
is there a way to retrieve which ID has identical Start date, i.e. is duplicated?
Desired output:
ID Start
0001 31JAN2015
0001 31JAN2015
0004 31DEC2016
0004 31DEC2016
Thank you in advance
Use proc sort to remove duplicates and create a list of IDs with duplicates.
proc sort data=have nodupkey dupout=dupes;
by date;
run;
In your example, ID 4 does not have a duplicate start date although the start day itself is the same (31st).
is there a way to remove time periods (for the same variable, in my case "absence_reason") when they are sub-periods of larger ones?
Suppose to have the following:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 02JAN2015 02JAN2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
0002 12JUN2017 18JUN2017 vacation
...;
Desired output:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
...;
Sub-periods are always completely overlapping (considering Start and End).
Thank you in advance
Although I agree with Dirk that it is a not very reliable practice this code might help you to get the idea:
proc sort data=DB1;
by Id Absence_reason Start;
run;
data will;
set DB1;
by Id Absence_reason Start;
lastEnd = lag(End);
if First.Absence_reason then
output;
else do;
if lastEnd < Start then
output;
end;
drop lastEnd ;
run;
Output:
0001 13APR2015 31DEC2015 sick
0001 01JAN2015 06FEB2015 vacation
0002 01JAN2017 12JUL2017 vacation
is there a way to add a suffix to Ids based on a column? Suppose to have:
For example:
ID Suffix
0001 1
0001 1
0001 1
0001 1
0001 2
0001 2
0002 1
0002 2
0002 2
0002 1
..... ....
Desired output
ID
0001_1
0001_1
0001_1
0001_1
0001_2
0001_2
0002_1
0002_2
0002_2
0002_1
.....
Thank you in advance
Sure can! Concatenate them together using cats(). This is assuming id is a character. If it is numeric, you need to create a new column.
data want;
set have;
id = cats(id, '_', suffix);
run;
suppose to have the following:
ID Start End Place
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
..... ........ ......... .....
I would like to set a flag for the start and a flag for the end as follows:
for each ID, for consecutive periods and same Place, put "1" in StartFlag column relative to the first date in Start column (first two rows of desired output); fill the remaining StartFlag and EndFlag with 0;
there is a jump of years for the same ID and the Place changes put "1" in StartFlag column referred to Start and 0 to the remaining (EndFlag column). This refers to rows 3 and 4 of desired output. The idea is to record the change in the Place;
if there are jumps of years but the Place does not change, then put: "1" in the column StarFlag for the first date in Start column, "9" to the EndFlag before the jump and "9" in the column StartFlag relative to the starting of the next non-consecutive period.
I tried with if statement. I don't know how to "call" the first date of each consecutive/non-consecutive period of each Place.
Thank you in advance
Desired output:
ID Start End Place StartFlag EndFlag
0001 13JAN2015 20JAN2015 HospA 1 0
0001 21JAN2015 31DEC2015 HospA 0 0
0001 01JAN2018 31DEC2018 HospB 1 0
0001 01JAN2019 31DEC2019 HospA 1 0
0002 01JAN2015 30SEP2015 HospA 1 0
0002 01OCT2015 31DEC2015 HospA 0 9
0002 01JAN2019 31OCT2019 HospA 9 0
0002 01NOV2019 31DEC2020 HospA 0 0
..... ........ ......... .....
You can use BY group processing to detect the first or last observation for an ID. And you can extend it to include PLACE by using the NOTSORTED option. But to compare the dates you need to look back and look ahead. Looking back is easy with the LAG() function. Looking ahead takes a little work, here is a simple method using dataset options to read starting from the second observation.
First make sure the data is sorted by ID and START date.
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
;
proc sort data=have;
by id start ;
run;
If you tell the data step the data is grouped by ID and PLACE you can use the FIRST.PLACE and LAST.PLACE flags. You just need to add some logic to test the date intervals.
data want;
set have ;
by id place notsorted;
lag_end=lag(end);
format lag_end date9.;
set have(firstobs=2 keep=start rename=(start=next_start))
have(obs=1 drop=_all_)
;
if first.place then startflag = 1;
else if lag_end+1 < start then startflag=1;
else startflag=0;
if last.place then endflag=1;
else if (end+1 < next_start) then endflag=1;
else endflag=0;
run;
Result:
The lag function is quite powerful, but here it is sufficient to use the retain statement to get the results. Normally, you have access to one row at a time in SAS due to the data vector, but by using retain you can keep values to the next row.
At first, we generate the sample data and sort it by id and start:
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 30SEP2015 HospA
0002 01OCT2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
run;
proc sort data=have;
by id start ;
run;
Now we use the retain statement to keep the values of id, start, end and place of the previous row.
(Actually, we don't need the by statement here, because first and last statements are not used.)
In this way, we can put the right values into start_flag:
data want (drop=prev:);
set have;
by id start;
format prev_start prev_end date9.;
retain prev_id prev_start prev_end prev_place;
if prev_id eq id AND prev_place eq place
AND prev_end+1 eq start
then start_flag=0;
else start_flag=1;
if prev_id eq id AND prev_place eq place
AND prev_end+1 ne start
then start_flag=9;
output;
prev_id=id; prev_start=start; prev_end=end; prev_place=place;
run;
End_flag is a little more tricky, because we need the next row, not the previous one.
Hence, we sort our data in the reverse order and use another retain (if the data amount is huge, consider using an index or two...).
proc sort data=want;
by descending id descending start;
quit;
data want_final (drop=next_startflag);
set want;
by descending id descending start;
retain next_startflag;
if next_startflag eq 9 then end_flag=9;
else end_flag=0;
output;
next_startflag=start_flag;
run;
Finally we sort the data back in the original order:
proc sort data=want_final;
by id start;
quit;