I have the following data set:
ID Start Stop
001 01JAN2013 31JAN2013
001 01FEB2013 31DEC2013
002 01MAR2013 31DC2013
003 01JAN2013 31DEC2013
I need the following output:
ID Start Stop Start_flag End_flag
001 01JAN2013 31JAN2013 1 2
001 01FEB2013 31DEC2013 2 3
002 01MAR2013 31DC2013 1 2
003 01JAN2013 31DEC2013 1 2
In other words I need to add a flag for the start and end with the exception that for consecutive periods the end flag of the previous period will become the start flag of the subsequent period and the remaining end flag will be increased by 1.
Can anyone help me please?
Thnk you in advance
Use the LAG() function
proc sort data=have; by id start; run;
data want(drop=lag_stop);
set have;
by id start notsorted;
lag_stop = lag(stop);
if first.id then do;
start_flag=1;
end_flag=start_flag+1;
end;
else if lag_stop+1 = start then do;
start_flag+1;
end_flag+1;
end;
run;
want
id start stop start_flag end_flag
001 01JAN2013 31JAN2013 1 2
001 01FEB2013 31DEC2013 2 3
002 01MAR2013 31DEC2013 1 2
003 01JAN2013 31DEC2013 1 2
Related
is there a way in SAS to order columns (variables) of a data set based on the order of another data set? The names are perfectly equal.
And is there also a way to append them (vertically) based on the same column names?
Thank you in advance
ID YEAR DAYS WORK DATASET
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
DATASET DAYS WORK ID YEAR
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
I just need to sort the second data set based on the first and append the second to the first.
Can anyone help me please?
This should work:
data have1;
input ID YEAR DAYS WORK DATASET;
format ID z4.;
datalines;
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
;
run;
data have2;
input DATASET DAYS WORK ID YEAR;
format ID z4.;
datalines;
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
;
run;
First we create a new table by copying our first table. Then we just insert into it variables from the second table. No need to change the column order of the original second table.
proc sql;
create table want as
select *
from have1
;
insert into want(ID, YEAR, DAYS, WORK, DATASET)
select ID, YEAR, DAYS, WORK, DATASET
from have2
;
quit;
I have no idea how you could sort based on something that is not there.
But appending is trivial. You can just set them together.
data want;
set one two;
run;
And if both dataset are already sorted by some key variables (year perhaps in your example?) then you could interleave the observations instead. Just add a BY statement.
data want;
set one two;
by year;
run;
And if you want to make a new version of the second dataset with the variable order modified to match the variable order in the first dataset (something that really has nothing to with sorting the data) you could use the OBS= dataset option. So code like this will order the variables based on the order they have in ONE but not actually use any of the data from that dataset.
data want;
set one(obs=0) two;
run;
is there a way to remove time periods (for the same variable, in my case "absence_reason") when they are sub-periods of larger ones?
Suppose to have the following:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 02JAN2015 02JAN2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
0002 12JUN2017 18JUN2017 vacation
...;
Desired output:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
...;
Sub-periods are always completely overlapping (considering Start and End).
Thank you in advance
Although I agree with Dirk that it is a not very reliable practice this code might help you to get the idea:
proc sort data=DB1;
by Id Absence_reason Start;
run;
data will;
set DB1;
by Id Absence_reason Start;
lastEnd = lag(End);
if First.Absence_reason then
output;
else do;
if lastEnd < Start then
output;
end;
drop lastEnd ;
run;
Output:
0001 13APR2015 31DEC2015 sick
0001 01JAN2015 06FEB2015 vacation
0002 01JAN2017 12JUL2017 vacation
suppose to have the following:
ID Start End Place
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
..... ........ ......... .....
I would like to set a flag for the start and a flag for the end as follows:
for each ID, for consecutive periods and same Place, put "1" in StartFlag column relative to the first date in Start column (first two rows of desired output); fill the remaining StartFlag and EndFlag with 0;
there is a jump of years for the same ID and the Place changes put "1" in StartFlag column referred to Start and 0 to the remaining (EndFlag column). This refers to rows 3 and 4 of desired output. The idea is to record the change in the Place;
if there are jumps of years but the Place does not change, then put: "1" in the column StarFlag for the first date in Start column, "9" to the EndFlag before the jump and "9" in the column StartFlag relative to the starting of the next non-consecutive period.
I tried with if statement. I don't know how to "call" the first date of each consecutive/non-consecutive period of each Place.
Thank you in advance
Desired output:
ID Start End Place StartFlag EndFlag
0001 13JAN2015 20JAN2015 HospA 1 0
0001 21JAN2015 31DEC2015 HospA 0 0
0001 01JAN2018 31DEC2018 HospB 1 0
0001 01JAN2019 31DEC2019 HospA 1 0
0002 01JAN2015 30SEP2015 HospA 1 0
0002 01OCT2015 31DEC2015 HospA 0 9
0002 01JAN2019 31OCT2019 HospA 9 0
0002 01NOV2019 31DEC2020 HospA 0 0
..... ........ ......... .....
You can use BY group processing to detect the first or last observation for an ID. And you can extend it to include PLACE by using the NOTSORTED option. But to compare the dates you need to look back and look ahead. Looking back is easy with the LAG() function. Looking ahead takes a little work, here is a simple method using dataset options to read starting from the second observation.
First make sure the data is sorted by ID and START date.
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
;
proc sort data=have;
by id start ;
run;
If you tell the data step the data is grouped by ID and PLACE you can use the FIRST.PLACE and LAST.PLACE flags. You just need to add some logic to test the date intervals.
data want;
set have ;
by id place notsorted;
lag_end=lag(end);
format lag_end date9.;
set have(firstobs=2 keep=start rename=(start=next_start))
have(obs=1 drop=_all_)
;
if first.place then startflag = 1;
else if lag_end+1 < start then startflag=1;
else startflag=0;
if last.place then endflag=1;
else if (end+1 < next_start) then endflag=1;
else endflag=0;
run;
Result:
The lag function is quite powerful, but here it is sufficient to use the retain statement to get the results. Normally, you have access to one row at a time in SAS due to the data vector, but by using retain you can keep values to the next row.
At first, we generate the sample data and sort it by id and start:
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 30SEP2015 HospA
0002 01OCT2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
run;
proc sort data=have;
by id start ;
run;
Now we use the retain statement to keep the values of id, start, end and place of the previous row.
(Actually, we don't need the by statement here, because first and last statements are not used.)
In this way, we can put the right values into start_flag:
data want (drop=prev:);
set have;
by id start;
format prev_start prev_end date9.;
retain prev_id prev_start prev_end prev_place;
if prev_id eq id AND prev_place eq place
AND prev_end+1 eq start
then start_flag=0;
else start_flag=1;
if prev_id eq id AND prev_place eq place
AND prev_end+1 ne start
then start_flag=9;
output;
prev_id=id; prev_start=start; prev_end=end; prev_place=place;
run;
End_flag is a little more tricky, because we need the next row, not the previous one.
Hence, we sort our data in the reverse order and use another retain (if the data amount is huge, consider using an index or two...).
proc sort data=want;
by descending id descending start;
quit;
data want_final (drop=next_startflag);
set want;
by descending id descending start;
retain next_startflag;
if next_startflag eq 9 then end_flag=9;
else end_flag=0;
output;
next_startflag=start_flag;
run;
Finally we sort the data back in the original order:
proc sort data=want_final;
by id start;
quit;
I have data that looks something like this:
ID Test Date
001 A 9/1/2011
001 A 10/2/2011
001 A 9/12/2012
001 A 10/10/2013 001 B 10/1/2011 001 B 1/1/2012 002 A 10/12/2014
002 A 10/13/2014 002 A 2/2/2015 002 A 11/15/2015
What I would like to do is read in the first record of ID/Test, and then compare it to the next record of the same ID/Test. If that test date is NOT at least 365 days later then delete it. And then re-test the next record. If it is at least 365 days later, then I will keep it, and use it as the new comparison date within that ID/Test group for the next records. But each ID/Test combination will have a varying number of records and dates.
I would like it to end up like this:
ID Test Date
001 A 9/1/2011
001 A 9/12/2012
001 A 10/10/2013 001 B 10/1/2011 002 A 10/12/2014
002 A 11/15/2015
Thanks for any help -
ETA: Code I have tried:
data want; set have;
lagid=lag(id); lagtest=lag(test); lagdate=lag(date):
if id=lagid AND test=lagtest then days=date-lagdate;
if 1 le days le 365 then delete;
run;
This code only works for pairs that are next to each other. In my sample data it would give me the incorrect results of -
ID Test Date
001 A 9/1/2011
001 A 10/10/2013
001 B 10/1/2011
002 A 10/12/2014
ETA: I found a solution using RETAIN and set by ID and Test.
data begin;
input ID Test $ date mmddyy10.;
cards;
001 A 09/01/2011
001 A 10/02/2011
001 A 09/12/2012
001 A 10/10/2013
001 B 10/01/2011
001 B 01/01/2012
002 A 10/12/2014
002 A 10/13/2014
002 A 02/02/2015
002 A 11/15/2015
;
run;
proc sort data=begin; by id test date; run;
data processed;
retain days_since;
set begin;
by id test;
if first.test then do; /*Prime the flow variable and output the base values*/
days_since=date;
output;
end;
if (date-days_since)>=365 then do;
days_since = date;
output;
end;
format date yymmdd10.;
run;
I would like to pick values within an ID variable which are 10% of each other.
For example, my data looks like this:
ID Var1
001 100
001 109
001 200
001 210
001 220
001 300
001 310
002 500
002 510
My desired output is some way to flag this so that I can separate this into groups:
ID Var1 Flag
001 100 1
001 109 1
001 200 2
001 210 2
001 220 2
001 300 3
001 310 3
002 500 1
002 510 1
I tried using a lag function and flagging data but it only flags the second row in a pair; I am not able to pull both the values in a pair that are within 10 percent of each other.
Here's how to flag if the difference between records are within 10% of each other. You can determine the 10% ratio by dividing the numbers, subtracting 1 and taking the absolute value. This assumes your data is sorted by ID and ascending var1 value.
data want;
set have;
by ID;
retain group;
lagv1=lag(var1);
if first.id then do;
lagv1=.;
group=1;
end;
else do;
diff = abs(var1/lagv1-1);
if diff >0.1 then group+1;
end;
run;