suppose to have the following:
ID Start_date End_date Hospital Work
00001 01JAN2015 15JAN2015 006 w
00001 16JAN2015 16JAN2015 006 p
00001 17JAN2015 20JAN2015 006 w
00001 21JAN2015 29JAN2015 006 f
00001 30JAN2015 02FEB2015 004 w
00001 03FEB2015 03FEB2015 004 s
00001 04FEB2015 08FEB2015 004 w
00001 09FEB2015 13FEB2015 004 f
00001 14FEB2015 16FEB2015 006 f
00001 17FEB2015 28DEC2016 006 w
00001 29DEC2016 31DEC2016 006 w
.... ..... ...... ... ...
Desired output:
ID Start_date End_date Hospital Work Flag1 Flag2
00001 01JAN2015 15JAN2015 006 w 1 4
00001 16JAN2015 16JAN2015 006 p 4 9
00001 17JAN2015 20JAN2015 006 w 9 4
00001 21JAN2015 29JAN2015 006 f 4 9
00001 30JAN2015 02FEB2015 004 w 9 2
00001 03FEB2015 03FEB2015 004 s 2 9
00001 04FEB2015 08FEB2015 004 w 9 4
00001 09FEB2015 13FEB2015 004 f 4 9
00001 14FEB2015 16FEB2015 006 f 9 2
00001 17FEB2015 28DEC2016 006 w 2 4
00001 29DEC2016 31DEC2016 006 w 4 Stop
.... ..... ...... ... ...
in other words I need to add two columns: Flag1 and Flag2 containing indices with the following criteria:
if the the first Start_date for the ID then Flag1 must always be 1. Then flag2 will contain four indices as follows: 4 if "w" in Work column, 9 if not "w" in Work column (f, s or other), 2 if Hospital changes (here from 006 to 004 and then 006 again) and Stop for the end of the period, here 31DEC2016 but it could be 31DEC2019 or 31DEC2020 depending on the ID. Totally I have 350 IDs that are repeated because I have many periods per ID.
Column Flag1 will take the previous index of Flag2 column.
Can anyone help me please?Thank you in advance
data source_data;
input ID :$5. Start_date :date9. End_flag :date9. Hospital :$3. Work :$1.;
format Start_date End_flag date9.;
datalines;
00001 01JAN2015 15JAN2015 006 w
00001 16JAN2015 16JAN2015 006 p
00001 17JAN2015 20JAN2015 006 w
00001 21JAN2015 29JAN2015 006 f
00001 30JAN2015 02FEB2015 004 w
00001 03FEB2015 03FEB2015 004 s
00001 04FEB2015 08FEB2015 004 w
00001 09FEB2015 13FEB2015 004 f
00001 14FEB2015 16FEB2015 006 f
00001 17FEB2015 28DEC2016 006 w
00001 29DEC2016 31DEC2016 006 w
;
proc sort data=source_data;
by ID start_date hospital;
run;
data destination_data;
retain ID Start_date End_flag Hospital Work Flag1 Flag2;
attrib Flag1 length=$8 Flag2 length=$8;
set source_data;
by id start_date hospital;
retain Flag2R;
if work='w' then Flag2='4';
else Flag2='9';
if not first.ID and lag(hospital) NE hospital then Flag2='2';
if last.ID then Flag2='Stop';
Flag2R=lag(Flag2);
if first.ID then flag1='1';
else flag1=Flag2R;
drop Flag2R;
run;
proc print data=destination_data noobs;
run;
Related
suppose to have the following simple case (only for explanatory purposes. Original data are more complicated to show):
data have;
input ID :$20. Label1 :$20. Label2 :$20. Hours :$20.;
cards;
0001 rep1 w 345
0001 rep1 f 985
0001 rep1 w 367
0001 rep2 w 65
0001 rep2 w 123
0001 rep2 f 120
0002 rep6 f 45
0002 rep6 w 657
0002 rep6 w 45
0002 rep1 w 567
0002 rep1 f 78
0002 rep1 w 9
..... .... ... ...
;
I would like to sum, foreach ID the hours corresponding to "w" but also stratifying by Label1, i.e. rep*. I used:
data want;
set have;
by ID Label1;
if first.ID
..........
if last.ID
.........
run;
Although I was able to stratify by ID I was not able to stratify by Label1.
Is it possible to write as follows: if first.ID and first.Label1 then....?
While doing some attempts, SAS gave me also the following error:
"by variables are not properly sorted on data set have". Input data are sorted by ID.
Thank you in advance
Obviously and as you said, the input data is sorted by ID, so you can use first.ID. But the data is not sorted by label, therefore you cannot use first.label. If you want to use both you have to sort by both variables:
proc sort data=have;
by ID label;
quit;
But keep in mind that in your sample data there will then be not only one first.label=1 for label=rep1, but twice:
first.ID first.label
0001 rep1 w 345 1 1
0001 rep1 f 985 0 0
0001 rep1 w 367 0 0
0001 rep2 w 65 0 1
0001 rep2 w 123 0 0
0001 rep2 f 120 0 0
0002 rep1 w 567 1 1
0002 rep1 f 78 0 0
0002 rep1 w 9 0 0
0002 rep6 f 45 0 1
0002 rep6 w 657 0 0
0002 rep6 w 45 0 0
suppose to have the following data set:
ID Date_Start Date_End Flag1 Flag2
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 0
002 01JAN2015 31DEC2020 1 0
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 0
004 01JAN2011 31DEC2021 1 2
..... ......... ......... ..... ......
Desired output:
ID Date_Start Date_End Flag1 Flag2
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 10
002 01JAN2015 31DEC2020 1 10
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 10
004 01JAN2011 31DEC2021 1 2
..... ......... ......... ..... ......
In other words: if Flag2 == 0 and Flag1 == 1 replace the flag in Flag2 column with 10 for each ID as follows:
for replicated IDs take the last interval of time;
for unique IDs take the interval you have.
I'm a newbie in SAS programming. I know that what I have to do is:
data my data;
set input;
if Flag2 = 0 AND Flag1 = 1 then Flag2 = 10
run;
but I don't know how to manage periods and replicated IDs. Can anyone help me please?
I'm not entirely sure here, but I think this is what you want.
data have;
input ID $ (Date_Start Date_End)(:date9.) Flag1 Flag2;
format Date_Start Date_End date9.;
datalines;
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 0
002 01JAN2015 31DEC2020 1 0
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 0
004 01JAN2011 31DEC2021 1 2
;
data want;
set have;
by ID;
if last.ID and flag1 = 1 and flag2 = 0 then flag2 = 10;
run;
Result
ID Date_Start Date_End Flag1 Flag2
001 13JAN2015 01JUN2018 1 0
001 02JUN2018 02JUL2018 1 0
001 03JUL2018 31DEC2020 1 10
002 01JAN2015 31DEC2020 1 10
003 01JAN2017 31DEC2019 1 0
003 01JAN2020 31DEC2021 1 10
004 01JAN2011 31DEC2021 1 2
suppose to have two data sets (files). File 1 is composed by time-periods with a label for each one and File2 that contains sub-periods without labels. I need to add labels to File2 based on the time interval from File1 so that if the period has Label "x" and the sub-period is contained in the period of File1, the sub-period will take the label from the period of File1.
Can anyone help me please?
data have1;
input ID :$20. Start :date9. End :date9. Label :$20. Role :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30APR2015 HospitalA ex005
0001 01MAY2015 31MAY2015 HospitalA ex004
0001 01JUN2015 31DEC2015 HospitalC ex005
0002 06FEB2018 08FEB2018 HospitalA ex004
0002 09FEB2018 31AUG2018 HospitalC ex005
0002 01SEP2018 31DEC2019 HospitalC ex004
0003 01JAN2019 30SEP2019 HospitalD ex008
0003 01OCT2019 31DEC2020 HospitalD ex004
;
File2:
data have2;
input ID :$20. Start :date9. End :date9.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30JAN2015
0001 31JAN2015 15FEB2015
0001 15FEB2015 30APR2015
0001 01MAY2015 15MAY2015
0001 16MAY2015 31MAY2015
0001 01JUN2015 15SEP2015
0001 16SEP2015 31DEC2015
......
;
File3 desired output:
data output;
input ID :$20. Start :date9. End :date9. Label :$20. Role :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30JAN2015 HospitalA ex005
0001 31JAN2015 15FEB2015 HospitalA ex005
0001 15FEB2015 30APR2015 HospitalA ex005
0001 01MAY2015 15MAY2015 HospitalA ex004
0001 16MAY2015 31MAY2015 HospitalA ex004
0001 01JUN2015 15SEP2015 HospitalC ex005
0001 16SEP2015 31DEC2015 HospitalC ex005
......
;
Try this
data have1;
input ID :$20. Start :date9. End :date9. Label :$20. Role :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30APR2015 HospitalA ex005
0001 01MAY2015 31MAY2015 HospitalA ex004
0001 01JUN2015 31DEC2015 HospitalC ex005
0002 06FEB2018 08FEB2018 HospitalA ex004
0002 09FEB2018 31AUG2018 HospitalC ex005
0002 01SEP2018 31DEC2019 HospitalC ex004
0003 01JAN2019 30SEP2019 HospitalD ex008
0003 01OCT2019 31DEC2020 HospitalD ex004
;
data have2;
input ID :$20. Start :date9. End :date9.;
format start end yymmdd10.;
cards;
0001 01JAN2015 30JAN2015
0001 31JAN2015 15FEB2015
0001 15FEB2015 30APR2015
0001 01MAY2015 15MAY2015
0001 16MAY2015 31MAY2015
0001 01JUN2015 15SEP2015
0001 16SEP2015 31DEC2015
;
data want(drop = s e);
if _N_ = 1 then do;
dcl hash h(dataset : 'have1(rename = (Start = s End = e)', multidata : 'Y');
h.definekey('ID');
h.definedata('s', 'e', 'Label', 'Role');
h.definedone();
dcl hiter i('h');
end;
set have2;
if 0 then set have1(rename = (Start = s End = e));
call missing(s, e, Label, Role);
do while (i.next() = 0);
if Start >= s and End <= e then leave;
else call missing(Label, Role);
end;
run;
Result:
ID Start End Label Role
0001 2015-01-01 2015-01-30 HospitalA ex005
0001 2015-01-31 2015-02-15
0001 2015-02-15 2015-04-30 HospitalA ex005
0001 2015-05-01 2015-05-15 HospitalA ex004
0001 2015-05-16 2015-05-31
0001 2015-06-01 2015-09-15 HospitalC ex005
0001 2015-09-16 2015-12-31
I would like to enter the name of my variables as numbers e.g. '1950-1959' and I'm using the INPUT statement, but the output is not appearing correctly.
DATA data1;
INPUT AgeGroup$ 1950-1959 1960-1969 1970-1979 1980-1989 1990-1992 Total;
DATALINES;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
RUN;
Could you please tell me if I need to use any special characters to specify that '1950-1959' etc. are names of the numeric variable?
Thanks!
You can use name literals to specify names that don't follow the normal rules, for example '1950-1959'n. Make sure that the VALIDVARNAME option is set to ANY so that SAS will allow the non-standard names. You could use standard names for the variables and use the label to store that description.
input AgeGroup :$5. period1-period6 ;
label period1 = '1950-1959' period2 = '1960-1969' ....
It would probably be more useful to store the time period into a variable instead.
data data1;
length AgeGroup $5 Period $9 count 8;
input AgeGroup #;
do period='1950-1959','1960-1969','1970-1979','1980-1989','1990-1992','Total';
input count #;
output;
end;
datalines;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
In that structure you can more easily filter to the data for subset of the time periods. But you could still easily create a report that displays the data in that tabular layout.
proc report data=data1;
columns agegroup count,period ;
define agegroup / group ;
define period / across ' ';
define count / ' ';
run;
Results:
AgeGr
oup 1950-1959 1960-1969 1970-1979 1980-1989 1990-1992 Total
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
Enable extended character names with options validvarname=any, then specify each as a name literal like 'this'n:
options validvarname=any;
DATA data1;
INPUT AgeGroup$ '1950-1959'n '1960-1969'n '1970-1979'n '1980-1989'n '1990-1992'n Total;
DATALINES;
20-29 1919 1808 1990 2175 154 8046
30-39 2616 4585 6580 6843 1921 22545
40-49 705 2661 5027 6597 1812 16802
50-59 38 680 2562 4836 2127 10243
60-69 0 35 606 2314 831 3786
70-79 0 0 23 467 494 984
80-89 0 0 0 12 31 43
Total 5278 9769 16788 23244 7370 62449
;
RUN;
Most modern SAS applications automatically specify this option, but occasionally you'll run into systems that still have v7 names.
Here I have a list of weight of 2 subjects.
data weight_test;
format subject $3. weight 4.;
infile datalines dlm=" " dsd;
input subject weight ;
datalines;
001 27
001 27.5
001 28
001 30
001 29
001 29
002 29
002 30
002 31
002 29
;
run;
I want to mark the weight with 0 and 1:
If the weight < 30 then mark with 0;
Once the weight >= 30 occurs then mark the rest of weight within the same subject with 1.
As the following lists:
subject weight mark
001 27 0
001 27 0
001 28 0
001 30 1
001 29 1
001 29 1
002 29 0
002 30 1
002 31 1
002 29 1
I tried to use the following codes, but it doesn't work properly. Please help me. Thank you~
data weight;
set weight_test;
by subject;
i=0;
retain i;
if weight < 30 then mark=i;
else if weight >= 30 then do;
i = 1;
mark = i;
end;
run;
You have over complicated it. Just set MARK to zero when you start a new subject and set it to one when the target weight is seen.
data weight_test;
input subject $ weight ## ;
datalines;
001 27 001 27.5 001 28 001 30 001 29 001 29
002 29 002 30 002 31 002 29
;
data weight;
set weight_test;
by subject;
if first.subject then mark=0;
if weight >= 30 then mark=1;
retain mark;
run;
Results:
Obs subject weight mark
1 001 27.0 0
2 001 27.5 0
3 001 28.0 0
4 001 30.0 1
5 001 29.0 1
6 001 29.0 1
7 002 29.0 0
8 002 30.0 1
9 002 31.0 1
10 002 29.0 1
Make sure the variable MARK does not already exist in the input dataset.
Try this
data want;
mark=0; _iorc_=0;
do until (last.subject);
set weight_test;
by subject;
if weight >= 30 & _iorc_=0 then do;
_iorc_=1;
mark=1;
end;
output;
end;
run;