suppose to have the following data set.
ID Hired Start_date End_date Flag_Start Flag_End
0001 1-1900 01JAN2018 21DEC2018 1 2
0001 1-1900 01JAN2019 01DEC2020 2 2
0002 10-2020 26MAR2020 03MAY2020 1 2
0003 03-2021 18DEC2020 31DEC2020 1 2
..... ....... ......... ......... ........... ...........
I would like the desired output. Sorry if I ask you but I'm a newbie and this seems to be a very difficult task with SAS. I'm familiar with R.
Desired output:
ID Hired Start_date End_date Flag_Start Flag_End
0001 1-1900 01JAN2018 21DEC2018 1 2
0001 1-1900 01JAN2019 01DEC2020 2 3
0002 03-2020 26MAR2020 03MAY2020 1 0
0003 03-2021 18DEC2020 31DEC2020 1 3
..... ....... ......... ......... ........... ...........
So, for each ID, if, after sorting, the last End_date is "x" and the "Hired" is 1-1900 then in Flag_End add 3 otherwise if Hired is < End_date add 0 otherwise if Hired is > End_date but not 1-1900 add 3.
Thank you in advance
I think this is what you want.
The Hired Date does not match between your two posted data sets. I chose the second one (03-2020).
data have;
input ID $ Hired :anydtdte. (Start_date End_date)(:date9.) Flag_Start Flag_End;
format Hired Start_date End_date date9.;
datalines;
0001 1-1900 01JAN2018 21DEC2018 1 2
0001 1-1900 01JAN2019 01DEC2020 2 2
0002 03-2020 26MAR2020 03MAY2020 1 2
0003 03-2021 18DEC2020 31DEC2020 1 2
;
data want;
set have;
by ID;
if last.ID then do;
if Hired = '01jan1900'd then flag_end = 3;
else if Hired < End_date then flag_end = 0;
else if Hired >= End_date then flag_end = 3;
end;
run;
Related
is there a way to count how many days pass from a start-end to the next one? Let say:
ID Start End
0001 22JAN2022 23JAN2022
0001 26JAN2022 30JAN2022
0001 03MAR2022 08MAR2022
0001 09MAR2022 15MAR2022
0001 17MAR2022 30MAR2022
desired output:
ID Start End days
0001 22JAN2022 23JAN2022 3
0001 26JAN2022 30JAN2022 4
0001 03FEB2022 08MAR2022 1
0001 09MAR2022 15MAR2022 2
0001 17MAR2022 30MAR2022 .......
I believe I demonstrated this in another thread but there you go
data have;
input ID $ (Start End)(:date9.);
format Start End date9.;
datalines;
0001 22JAN2022 23JAN2022
0001 26JAN2022 30JAN2022
0001 03FEB2022 08MAR2022
0001 09MAR2022 15MAR2022
0001 17MAR2022 30MAR2022
;
data want;
set have;
by ID;
set have(firstobs = 2 rename = start = s keep = start)
have(obs = 1 drop = _all_);
if last.ID then s = .;
days = s - end;
run;
suppose to have the following:
data have;
input ID :$20. Label :$20. Hours :$20. Days :$20.;
cards;
0001 w 3144 3
0001 w 23 54
0001 p 12 1
0002 m 456 34
0002 w 2 1
0002 s 231 45
0002 w 98 23
0003 w 12 6
0003 w 98 76
;
Is there a way to, for each ID, sum the Hours, so get the total and then split it by the days but only when the label is == w? If the label is not w put a missing.
Desired output:
data have;
input ID :$20. Label :$20. Hours :$20. Days :$20.;
cards;
0001 w 167.3158 3
0001 w 3011.684 54
0001 p . 1
0002 m . 34
0002 w 32.79167 1
0002 s . 45
0002 w 754.2084 23
0003 w 8.048778 6
0003 w 101.9512 76
;
In other words: for 0001 in the desired output example I added: 3144+23+12 = 3179, the 54+3=57 that are the days where the label is "w" then I divided 3179 by 57 and multiplied the result for 3 and 54 but not for 1 respectively.
Thank you in advance
Same idea with #Stu Sztukowski, but use DOW-Loop skill:
data want;
do until(last.id);
set have;
by id notsorted;
sum_of_hours=sum(sum_of_hours,input(hours,best.));
sum_of_days_w=sum(sum_of_days_w,(label='w')*input(days,best.));
end;
do until(last.id);
set have;
by id notsorted;
if label='w' then hours=cats(sum_of_hours*(input(days,best.)/sum_of_days_w));
else hours='';
output;
end;
run;
The calculation you need to do looks like this in code form:
if(id = 'w') then hours = sum_of_hours_w/sum_of_days * days
else hours = .
All we need to do is get the sum of hours and days where label = 'w', then merge it back with our original table by id. The table to do this calculation would look like this:
id
label
hours
days
sum_of_hours
sum_of_days_w
0001
w
3144
3
3179
57
0001
w
23
54
3179
57
You can accomplish this all in a single SQL step.
proc sql;
create table want as
select t1.id
, t1.label
, CASE(t1.label)
when('w') then t2.sum_hours/t2.sum_days_w * t1.days
else .
END as hours
, t1.days
from have as t1
/* Get the sum of all hours and days where label = 'w' */
LEFT JOIN
(select id
, sum( (label = 'w')*days ) as sum_days_w
, sum(hours) as sum_hours
from have
group by id
) as t2
ON t1.id = t2.id
;
quit;
I have data that looks like -
data abc;
input ID $ drug $ episode start_date date9. end_date date9.;
format start_date end_date date9.;
informat start_date end_date date9.;
datalines ;
1 A 1 01Jan2012 30Mar2012
1 A 2 01May2012 03Jul2012
1 A 3 28Sep2012 28Oct2012
1 A 4 01Nov2012 30Dec2012
1 B 1 01Apr2012 10May2012
1 B 2 02Nov2012 28Dec2012
1 B 3 01Jan2012 30Mar2012
1 C 1 01Jul2012 02Aug2012
;
run;
Here we have subjects and the the drugs they take. A new episode of one drug means that the person discontinued.
If the start date (start date of 1st episode) of second drug consumed , lies in between the episodes of first drug , then we will ignore all the further episodes of 1st drug.
Eg. here 1 april (start date of drug B) lies after the first episode of drug A, so episode 2,3,4 of drug A would be deleted.
Similarly the start date for drug C lies after the end date of episode 1 for drug B then episode 2 of drug B would be deleted.
The maximum number of episodes a subject can have is 15.
The resultant dataset should look like -
ID Drug Episode start_date end_date
1 A 1 1-Jan 30-Mar
1 B 1 1-Apr 10-May
1 C 1 1-Jul 2-Aug
How about this? I added another ID to the example data for demonstration.
data abc;
input ID $ drug $ episode start_date :date9. end_date :date9.;
format start_date end_date date9.;
datalines ;
1 A 1 01Jan2012 30Mar2012
1 A 2 01May2012 03Jul2012
1 A 3 28Sep2012 28Oct2012
1 A 4 01Nov2012 30Dec2012
1 B 1 01Apr2012 10May2012
1 B 2 02Nov2012 28Dec2012
1 B 3 01Jan2012 30Mar2012
1 C 1 01Jul2012 02Aug2012
2 A 1 01Jan2012 30Mar2012
2 A 2 01May2012 03Jul2012
2 A 3 28Sep2012 28Oct2012
2 A 4 01Nov2012 30Dec2012
2 B 1 01Apr2012 10May2012
2 B 2 02Nov2012 28Dec2012
2 B 3 01Jan2012 30Mar2012
2 C 1 01Jul2012 02Aug2012
;
run;
data want;
format ID drug episode start_date end_date;
keep ID drug episode start_date end_date;
declare hash h ();
h.definekey ('ID', 'd');
h.definedata ('_start_date');
h.definedone ();
do until (lr1);
set abc (rename= (start_date = _start_date)) end=lr1;
by ID drug;
if first.ID then d = 0;
if first.drug then d + 1;
if episode = 1 then h.add();
end;
do until (lr2);
set abc end=lr2;
by ID drug;
if first.ID then d = 0;
if first.drug then do;
d + 1; flag = 0;
end;
rc = h.find(key : ID, key : d+1);
if start_date > _start_date then flag=1;
if flag = 0 then output;
end;
retain flag;
run;
Result:
ID drug episode start_date end_date
1 A 1 01JAN2012 30MAR2012
1 B 1 01APR2012 10MAY2012
1 C 1 01JUL2012 02AUG2012
2 A 1 01JAN2012 30MAR2012
2 B 1 01APR2012 10MAY2012
2 C 1 01JUL2012 02AUG2012
I am trying to create a column that returns 1 for the Max Rev number. My table looks like this:
Serial# Enrollment# Rev#
1234 0001 0
1234 0001 1
2225 0002 0
9999 0003 0
9999 0003 1
9999 0003 2
9999 0004 0
I want my result to look like this:
Serial# Enrollment# Rev# MaxRev
1234 0001 0 0
1234 0001 1 1
2225 0002 0 1
9999 0003 0 0
9999 0003 1 0
9999 0003 2 1
9999 0004 0 1
Here is what I tried:
MaxRev = IF(CVA[Rev#] = CALCULATE(MAX([Rev#]),CVA[Enrollment#] = EARLIER(CVA[Enrollment#])),1,0)
But it is not working. Thanks in advance
Use ALLEXCEPT to remove the filter context on all columns except [Enrollment#]. Note, because of CALCULATE (context transition), this may be slow if the table is large (millions of rows).
MaxRev = IF(CVA[Rev#] = CALCULATE(MAX(CVA[Rev#]), ALLEXCEPT(CVA, CVA[Enrollment#])), 1, 0)
Tried various formats of date, but output do not reflects any date. What could be the issue?
data c;
input age gender income color$ doj$;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;
You are mixing things up a bit.
The date formats are to be applied on numeric data, not on text data.
So you should not read in doj as $ (text), but as a date (so a date informat).
Try DDMMYY10. for doj on your input statement:
data c;
input age gender income color$ doj ddmmyy10.;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;