Conditional replacement of labels based on another data set - sas

suppose to have the following data set:
ID Label
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
and another one:
ID Label
0001 0001_1
0001 0001_1
0001 0001_2
0001 0001_2
0001 0001_3
0001 0001_3
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
You want the following:
if in the first dataset there is only one type of Label (i.e., 0001_1), the second dataset should have that type. Otherwise if there are multiple labels nothing must be done. The desired output should be:
ID Label
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
Thank you in advance
Best

You will want to compute the groups in the first table that have a single label in aggregate and apply that label to the groups in the second table.
Example:
Computation with PROC FREQ and application via MERGE.
data have1;
call streaminit(20231);
do id = 1 to 10;
do seq = 1 to rand('integer', 10) + 2;
if mod(id,2) = 0
then label = 'AAA';
else label = repeat(byte(64+rand('integer', 26)),2);
output;
end;
end;
run;
data have2;
call streaminit(20232);
do id = 1 to 10;
do seq = 1 to rand('integer', 12) + 2;
label = repeat(byte(64+rand('integer', 26)),2);
output;
end;
end;
run;
proc freq noprint data=have1;
by id;
table label / out=one_label(where=(percent=100));
run;
data want2;
merge
have2
one_label(keep=id label rename=(label=have1label) in=reassign)
;
by id;
if reassign then label = have1label;
drop have1label;
run;
Same result achieved with SQL code, performing computation in a sub-select and using COALESCE for application.
proc sql;
create table want2 as
select
have2.id
, coalesce(singular.onelabel, have2.label) as label
from
have2
left join
( select unique id, label as onelabel
from have1
group by id
having count(distinct label) = 1
) as singular
on
have2.id = singular.id
;

Related

Subset repeated elements based on a list

is there a way to subset from a data set some IDs based on an external list?
In other words, I have a data set:
ID Value
0001 0.3
0001 0.6
0002 0.7
0002 0.71
0002 0.43
0003 0.01
0003 0.2
0005 12
0005 11
and a list:
ID
0001
0002
0005
The desired output would be:
ID Value
0001 0.3
0001 0.6
0002 0.7
0002 0.71
0002 0.43
0005 12
0005 11
My point is that in the data set, IDs are repeated.
Thank you in advance
SQL is the easiest for sure. Assuming ID list is in a data set, you can do something like this:
HAVE: Input data set with all records
ID_LIST: input data set with ids to be selected
proc sql;
create table want as
select * from have
where ID in
(select id from id_list);
quit;
There are several other ways to solve this including a data step merge (requires pre-sorting) and/or a hash table.
One way of doing it is by saving your IDs from external list into macro variable, and then use select statement using that variable.
data have;
input ID Value;
format ID z4.;
datalines;
0001 0.3
0001 0.6
0002 0.7
0002 0.71
0002 0.43
0003 0.01
0003 0.2
0005 12
0005 11
;
run;
This list can be imported from an external file using proc import (in this example I used external xlsx file, located at &path.).
proc import
datafile="&path."
out=list
dbms=xlsx
replace;
run;
Or can be a separate existing table.
data list;
input ID;
datalines;
0001
0002
0005
;
run;
Here we save IDs from your list into a macro variable x.
proc sql noprint;
select ID into :x separated by ','
from list
;
quit;
When you have your selected IDs in a macro variable you can use that for a simple select statement.
proc sql;
create table want as
select *
from have
where ID in (&x.)
;
quit;

Get duplicated by date

suppose to have the following:
ID Start
0001 31JAN2015
0001 31JAN2015
0003 16FEB2016
0006 01FEB2018
0004 31DEC2016
0004 31DEC2016
is there a way to retrieve which ID has identical Start date, i.e. is duplicated?
Desired output:
ID Start
0001 31JAN2015
0001 31JAN2015
0004 31DEC2016
0004 31DEC2016
Thank you in advance
Use proc sort to remove duplicates and create a list of IDs with duplicates.
proc sort data=have nodupkey dupout=dupes;
by date;
run;
In your example, ID 4 does not have a duplicate start date although the start day itself is the same (31st).

Remove overlapping time-periods

is there a way to remove time periods (for the same variable, in my case "absence_reason") when they are sub-periods of larger ones?
Suppose to have the following:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 02JAN2015 02JAN2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
0002 12JUN2017 18JUN2017 vacation
...;
Desired output:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
...;
Sub-periods are always completely overlapping (considering Start and End).
Thank you in advance
Although I agree with Dirk that it is a not very reliable practice this code might help you to get the idea:
proc sort data=DB1;
by Id Absence_reason Start;
run;
data will;
set DB1;
by Id Absence_reason Start;
lastEnd = lag(End);
if First.Absence_reason then
output;
else do;
if lastEnd < Start then
output;
end;
drop lastEnd ;
run;
Output:
0001 13APR2015 31DEC2015 sick
0001 01JAN2015 06FEB2015 vacation
0002 01JAN2017 12JUL2017 vacation

Add a suffix based on the values of a variable

is there a way to add a suffix to Ids based on a column? Suppose to have:
For example:
ID Suffix
0001 1
0001 1
0001 1
0001 1
0001 2
0001 2
0002 1
0002 2
0002 2
0002 1
..... ....
Desired output
ID
0001_1
0001_1
0001_1
0001_1
0001_2
0001_2
0002_1
0002_2
0002_2
0002_1
.....
Thank you in advance
Sure can! Concatenate them together using cats(). This is assuming id is a character. If it is numeric, you need to create a new column.
data want;
set have;
id = cats(id, '_', suffix);
run;

Conditionally set flags based on dates and a character variable

suppose to have the following:
ID Start End Place
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
..... ........ ......... .....
I would like to set a flag for the start and a flag for the end as follows:
for each ID, for consecutive periods and same Place, put "1" in StartFlag column relative to the first date in Start column (first two rows of desired output); fill the remaining StartFlag and EndFlag with 0;
there is a jump of years for the same ID and the Place changes put "1" in StartFlag column referred to Start and 0 to the remaining (EndFlag column). This refers to rows 3 and 4 of desired output. The idea is to record the change in the Place;
if there are jumps of years but the Place does not change, then put: "1" in the column StarFlag for the first date in Start column, "9" to the EndFlag before the jump and "9" in the column StartFlag relative to the starting of the next non-consecutive period.
I tried with if statement. I don't know how to "call" the first date of each consecutive/non-consecutive period of each Place.
Thank you in advance
Desired output:
ID Start End Place StartFlag EndFlag
0001 13JAN2015 20JAN2015 HospA 1 0
0001 21JAN2015 31DEC2015 HospA 0 0
0001 01JAN2018 31DEC2018 HospB 1 0
0001 01JAN2019 31DEC2019 HospA 1 0
0002 01JAN2015 30SEP2015 HospA 1 0
0002 01OCT2015 31DEC2015 HospA 0 9
0002 01JAN2019 31OCT2019 HospA 9 0
0002 01NOV2019 31DEC2020 HospA 0 0
..... ........ ......... .....
You can use BY group processing to detect the first or last observation for an ID. And you can extend it to include PLACE by using the NOTSORTED option. But to compare the dates you need to look back and look ahead. Looking back is easy with the LAG() function. Looking ahead takes a little work, here is a simple method using dataset options to read starting from the second observation.
First make sure the data is sorted by ID and START date.
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
;
proc sort data=have;
by id start ;
run;
If you tell the data step the data is grouped by ID and PLACE you can use the FIRST.PLACE and LAST.PLACE flags. You just need to add some logic to test the date intervals.
data want;
set have ;
by id place notsorted;
lag_end=lag(end);
format lag_end date9.;
set have(firstobs=2 keep=start rename=(start=next_start))
have(obs=1 drop=_all_)
;
if first.place then startflag = 1;
else if lag_end+1 < start then startflag=1;
else startflag=0;
if last.place then endflag=1;
else if (end+1 < next_start) then endflag=1;
else endflag=0;
run;
Result:
The lag function is quite powerful, but here it is sufficient to use the retain statement to get the results. Normally, you have access to one row at a time in SAS due to the data vector, but by using retain you can keep values to the next row.
At first, we generate the sample data and sort it by id and start:
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 30SEP2015 HospA
0002 01OCT2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
run;
proc sort data=have;
by id start ;
run;
Now we use the retain statement to keep the values of id, start, end and place of the previous row.
(Actually, we don't need the by statement here, because first and last statements are not used.)
In this way, we can put the right values into start_flag:
data want (drop=prev:);
set have;
by id start;
format prev_start prev_end date9.;
retain prev_id prev_start prev_end prev_place;
if prev_id eq id AND prev_place eq place
AND prev_end+1 eq start
then start_flag=0;
else start_flag=1;
if prev_id eq id AND prev_place eq place
AND prev_end+1 ne start
then start_flag=9;
output;
prev_id=id; prev_start=start; prev_end=end; prev_place=place;
run;
End_flag is a little more tricky, because we need the next row, not the previous one.
Hence, we sort our data in the reverse order and use another retain (if the data amount is huge, consider using an index or two...).
proc sort data=want;
by descending id descending start;
quit;
data want_final (drop=next_startflag);
set want;
by descending id descending start;
retain next_startflag;
if next_startflag eq 9 then end_flag=9;
else end_flag=0;
output;
next_startflag=start_flag;
run;
Finally we sort the data back in the original order:
proc sort data=want_final;
by id start;
quit;