Run a code while stratifying by two variables

Run a code while stratifying by two variables - sas

suppose to have the following simple case (only for explanatory purposes. Original data are more complicated to show):
data have;
input ID :$20. Label1 :$20. Label2 :$20. Hours :$20.;
cards;
0001 rep1 w 345
0001 rep1 f 985
0001 rep1 w 367
0001 rep2 w 65
0001 rep2 w 123
0001 rep2 f 120
0002 rep6 f 45
0002 rep6 w 657
0002 rep6 w 45
0002 rep1 w 567
0002 rep1 f 78
0002 rep1 w 9
..... .... ... ...
;
I would like to sum, foreach ID the hours corresponding to "w" but also stratifying by Label1, i.e. rep*. I used:
data want;
set have;
by ID Label1;
if first.ID
..........
if last.ID
.........
run;
Although I was able to stratify by ID I was not able to stratify by Label1.
Is it possible to write as follows: if first.ID and first.Label1 then....?
While doing some attempts, SAS gave me also the following error:
"by variables are not properly sorted on data set have". Input data are sorted by ID.
Thank you in advance

Obviously and as you said, the input data is sorted by ID, so you can use first.ID. But the data is not sorted by label, therefore you cannot use first.label. If you want to use both you have to sort by both variables:
proc sort data=have;
by ID label;
quit;
But keep in mind that in your sample data there will then be not only one first.label=1 for label=rep1, but twice:
first.ID first.label
0001 rep1 w 345 1 1
0001 rep1 f 985 0 0
0001 rep1 w 367 0 0
0001 rep2 w 65 0 1
0001 rep2 w 123 0 0
0001 rep2 f 120 0 0
0002 rep1 w 567 1 1
0002 rep1 f 78 0 0
0002 rep1 w 9 0 0
0002 rep6 f 45 0 1
0002 rep6 w 657 0 0
0002 rep6 w 45 0 0

Related

Assign labels based on multiple conditions

suppose to have the following:
ID Start_date End_date Hospital Work
00001 01JAN2015 15JAN2015 006 w
00001 16JAN2015 16JAN2015 006 p
00001 17JAN2015 20JAN2015 006 w
00001 21JAN2015 29JAN2015 006 f
00001 30JAN2015 02FEB2015 004 w
00001 03FEB2015 03FEB2015 004 s
00001 04FEB2015 08FEB2015 004 w
00001 09FEB2015 13FEB2015 004 f
00001 14FEB2015 16FEB2015 006 f
00001 17FEB2015 28DEC2016 006 w
00001 29DEC2016 31DEC2016 006 w
.... ..... ...... ... ...
Desired output:
ID Start_date End_date Hospital Work Flag1 Flag2
00001 01JAN2015 15JAN2015 006 w 1 4
00001 16JAN2015 16JAN2015 006 p 4 9
00001 17JAN2015 20JAN2015 006 w 9 4
00001 21JAN2015 29JAN2015 006 f 4 9
00001 30JAN2015 02FEB2015 004 w 9 2
00001 03FEB2015 03FEB2015 004 s 2 9
00001 04FEB2015 08FEB2015 004 w 9 4
00001 09FEB2015 13FEB2015 004 f 4 9
00001 14FEB2015 16FEB2015 006 f 9 2
00001 17FEB2015 28DEC2016 006 w 2 4
00001 29DEC2016 31DEC2016 006 w 4 Stop
.... ..... ...... ... ...
in other words I need to add two columns: Flag1 and Flag2 containing indices with the following criteria:
if the the first Start_date for the ID then Flag1 must always be 1. Then flag2 will contain four indices as follows: 4 if "w" in Work column, 9 if not "w" in Work column (f, s or other), 2 if Hospital changes (here from 006 to 004 and then 006 again) and Stop for the end of the period, here 31DEC2016 but it could be 31DEC2019 or 31DEC2020 depending on the ID. Totally I have 350 IDs that are repeated because I have many periods per ID.
Column Flag1 will take the previous index of Flag2 column.
Can anyone help me please?Thank you in advance

data source_data;
input ID :$5. Start_date :date9. End_flag :date9. Hospital :$3. Work :$1.;
format Start_date End_flag date9.;
datalines;
00001 01JAN2015 15JAN2015 006 w
00001 16JAN2015 16JAN2015 006 p
00001 17JAN2015 20JAN2015 006 w
00001 21JAN2015 29JAN2015 006 f
00001 30JAN2015 02FEB2015 004 w
00001 03FEB2015 03FEB2015 004 s
00001 04FEB2015 08FEB2015 004 w
00001 09FEB2015 13FEB2015 004 f
00001 14FEB2015 16FEB2015 006 f
00001 17FEB2015 28DEC2016 006 w
00001 29DEC2016 31DEC2016 006 w
;
proc sort data=source_data;
by ID start_date hospital;
run;
data destination_data;
retain ID Start_date End_flag Hospital Work Flag1 Flag2;
attrib Flag1 length=$8 Flag2 length=$8;
set source_data;
by id start_date hospital;
retain Flag2R;
if work='w' then Flag2='4';
else Flag2='9';
if not first.ID and lag(hospital) NE hospital then Flag2='2';
if last.ID then Flag2='Stop';
Flag2R=lag(Flag2);
if first.ID then flag1='1';
else flag1=Flag2R;
drop Flag2R;
run;
proc print data=destination_data noobs;
run;

Loop over time periods

suppose to have the following data set:
ID Date_Start Date_End Flag1 Flag2
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 0
002 01JAN2015 31DEC2020 1 0
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 0
004 01JAN2011 31DEC2021 1 2
..... ......... ......... ..... ......
Desired output:
ID Date_Start Date_End Flag1 Flag2
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 10
002 01JAN2015 31DEC2020 1 10
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 10
004 01JAN2011 31DEC2021 1 2
..... ......... ......... ..... ......
In other words: if Flag2 == 0 and Flag1 == 1 replace the flag in Flag2 column with 10 for each ID as follows:
for replicated IDs take the last interval of time;
for unique IDs take the interval you have.
I'm a newbie in SAS programming. I know that what I have to do is:
data my data;
set input;
if Flag2 = 0 AND Flag1 = 1 then Flag2 = 10
run;
but I don't know how to manage periods and replicated IDs. Can anyone help me please?

I'm not entirely sure here, but I think this is what you want.
data have;
input ID $ (Date_Start Date_End)(:date9.) Flag1 Flag2;
format Date_Start Date_End date9.;
datalines;
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 0
002 01JAN2015 31DEC2020 1 0
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 0
004 01JAN2011 31DEC2021 1 2
;
data want;
set have;
by ID;
if last.ID and flag1 = 1 and flag2 = 0 then flag2 = 10;
run;
Result
ID Date_Start Date_End Flag1 Flag2
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 10
002 01JAN2015 31DEC2020 1 10
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 10
004 01JAN2011 31DEC2021 1 2

Merge and update one file based on another one

suppose to have two data sets (files). File 1 is composed by time-periods with a label for each one and File2 that contains sub-periods without labels. I need to add labels to File2 based on the time interval from File1 so that if the period has Label "x" and the sub-period is contained in the period of File1, the sub-period will take the label from the period of File1.
Can anyone help me please?
data have1;
input ID :$20. Start :date9. End :date9. Label :$20. Role :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30APR2015 HospitalA ex005
0001 01MAY2015 31MAY2015 HospitalA ex004
0001 01JUN2015 31DEC2015 HospitalC ex005
0002 06FEB2018 08FEB2018 HospitalA ex004
0002 09FEB2018 31AUG2018 HospitalC ex005
0002 01SEP2018 31DEC2019 HospitalC ex004
0003 01JAN2019 30SEP2019 HospitalD ex008
0003 01OCT2019 31DEC2020 HospitalD ex004
;
File2:
data have2;
input ID :$20. Start :date9. End :date9.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30JAN2015
0001 31JAN2015 15FEB2015
0001 15FEB2015 30APR2015
0001 01MAY2015 15MAY2015
0001 16MAY2015 31MAY2015
0001 01JUN2015 15SEP2015
0001 16SEP2015 31DEC2015
......
;
File3 desired output:
data output;
input ID :$20. Start :date9. End :date9. Label :$20. Role :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30JAN2015 HospitalA ex005
0001 31JAN2015 15FEB2015 HospitalA ex005
0001 15FEB2015 30APR2015 HospitalA ex005
0001 01MAY2015 15MAY2015 HospitalA ex004
0001 16MAY2015 31MAY2015 HospitalA ex004
0001 01JUN2015 15SEP2015 HospitalC ex005
0001 16SEP2015 31DEC2015 HospitalC ex005
......
;

Try this
data have1;
input ID :$20. Start :date9. End :date9. Label :$20. Role :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30APR2015 HospitalA ex005
0001 01MAY2015 31MAY2015 HospitalA ex004
0001 01JUN2015 31DEC2015 HospitalC ex005
0002 06FEB2018 08FEB2018 HospitalA ex004
0002 09FEB2018 31AUG2018 HospitalC ex005
0002 01SEP2018 31DEC2019 HospitalC ex004
0003 01JAN2019 30SEP2019 HospitalD ex008
0003 01OCT2019 31DEC2020 HospitalD ex004
;
data have2;
input ID :$20. Start :date9. End :date9.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30JAN2015
0001 31JAN2015 15FEB2015
0001 15FEB2015 30APR2015
0001 01MAY2015 15MAY2015
0001 16MAY2015 31MAY2015
0001 01JUN2015 15SEP2015
0001 16SEP2015 31DEC2015
;
data want(drop = s e);
if _N_ = 1 then do;
dcl hash h(dataset : 'have1(rename = (Start = s End = e)', multidata : 'Y');
h.definekey('ID');
h.definedata('s', 'e', 'Label', 'Role');
h.definedone();
dcl hiter i('h');
end;
set have2;
if 0 then set have1(rename = (Start = s End = e));
call missing(s, e, Label, Role);
do while (i.next() = 0);
if Start >= s and End <= e then leave;
else call missing(Label, Role);
end;
run;
Result:
ID Start End Label Role
0001 2015-01-01 2015-01-30 HospitalA ex005
0001 2015-01-31 2015-02-15
0001 2015-02-15 2015-04-30 HospitalA ex005
0001 2015-05-01 2015-05-15 HospitalA ex004
0001 2015-05-16 2015-05-31
0001 2015-06-01 2015-09-15 HospitalC ex005
0001 2015-09-16 2015-12-31

Keeping data rows with same Unique ID after an Indicator variable changes from 1 to 0 for the first time in SAS

I need help with a SAS code which would keep data rows with same Unique ID after meeting a certain condition. For example, if I have a dataset called BASE and is as shown below;
Account_Number Default_Indicator
1010 0
1010 0
1010 1
1010 1
1010 1
1010 0
1010 0
1010 0
1010 1
1010 1
1020 0
1020 0
1020 0
1020 1
1020 1
1020 1
1020 0
1020 0
1020 1
1020 1
I would like the final dataset to keep rows after the Default_Indicator changes from 1 to 0 for the first time as shown below;
Account_Number Default_Indicator
1010 0
1010 0
1010 0
1010 1
1010 1
1020 0
1020 0
1020 1
1020 1
Help with this will be greatly appreciated.

You can use BY group processing, just add the NOTSORTED keyword to the BY statement. Use the LAG() function to access the value from the previous data step iteration. Retain the flag variable indicting that you have found a 1 -> 0 transition. Make sure to reset when starting a new account.
data have;
row+1;
input Account_Number Default_Indicator ## ;
cards;
1010 0 1010 0
1010 1 1010 1 1010 1
1010 0 1010 0 1010 0
1010 1 1010 1
1020 0 1020 0 1020 0
1020 1 1020 1 1020 1
1020 0 1020 0
1020 1 1020 1
;
data want ;
set have;
by account_number default_indicator notsorted;
lag_indicator=lag(default_indicator);
if first.account_number then call missing(found,lag_indicator);
if first.default_indicator and default_indicator=0 and lag_indicator=1 then found+1;
if found then output;
drop lag_indicator found;
run;
Results (without the DROP statement)
Account_ Default_ lag_
Obs row Number Indicator indicator found
1 6 1010 0 1 1
2 7 1010 0 0 1
3 8 1010 0 0 1
4 9 1010 1 0 1
5 10 1010 1 1 1
6 17 1020 0 1 1
7 18 1020 0 0 1
8 19 1020 1 0 1
9 20 1020 1 1 1

How to mark a sequence of weight, once a larger than 30 weight occors, mark the rest with 1?

Here I have a list of weight of 2 subjects.
data weight_test;
format subject $3. weight 4.;
infile datalines dlm=" " dsd;
input subject weight ;
datalines;
001 27
001 27.5
001 28
001 30
001 29
001 29
002 29
002 30
002 31
002 29
;
run;
I want to mark the weight with 0 and 1:
If the weight < 30 then mark with 0;
Once the weight >= 30 occurs then mark the rest of weight within the same subject with 1.
As the following lists:
subject weight mark
001 27 0
001 27 0
001 28 0
001 30 1
001 29 1
001 29 1
002 29 0
002 30 1
002 31 1
002 29 1
I tried to use the following codes, but it doesn't work properly. Please help me. Thank you~
data weight;
set weight_test;
by subject;
i=0;
retain i;
if weight < 30 then mark=i;
else if weight >= 30 then do;
i = 1;
mark = i;
end;
run;

You have over complicated it. Just set MARK to zero when you start a new subject and set it to one when the target weight is seen.
data weight_test;
input subject $ weight ## ;
datalines;
001 27 001 27.5 001 28 001 30 001 29 001 29
002 29 002 30 002 31 002 29
;
data weight;
set weight_test;
by subject;
if first.subject then mark=0;
if weight >= 30 then mark=1;
retain mark;
run;
Results:
Obs subject weight mark
1 001 27.0 0
2 001 27.5 0
3 001 28.0 0
4 001 30.0 1
5 001 29.0 1
6 001 29.0 1
7 002 29.0 0
8 002 30.0 1
9 002 31.0 1
10 002 29.0 1
Make sure the variable MARK does not already exist in the input dataset.

Try this
data want;
mark=0; _iorc_=0;
do until (last.subject);
set weight_test;
by subject;
if weight >= 30 & _iorc_=0 then do;
_iorc_=1;
mark=1;
end;
output;
end;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Run a code while stratifying by two variables - sas

Related

Assign labels based on multiple conditions

Loop over time periods

Merge and update one file based on another one

Keeping data rows with same Unique ID after an Indicator variable changes from 1 to 0 for the first time in SAS

How to mark a sequence of weight, once a larger than 30 weight occors, mark the rest with 1?

Categories

Resources