I’m trying to find patients who lost to follow-up (either permanently or temporarily) after their first event. So, we have an event date (edate) and total days of being registered for each year (regdays_09 to regdays_12). I want to flag patients who registered less than 365 days per year or have missing registered year. I much appreciate any help with this.
data Want;
informat edate date7.;
format edate date7.;
input ID edate regdays_09 regdays_10 regdays_11 regdays_12 flag
CARDS;
100 06jan09 365 365 365 365 0
101 10APR09 365 365 . . 1
102 23Mar09 180 . . . 1
103 03Sep09 365 . 365 365 1
104 20Aug09 300 . . 365 1
run;
If I understand you correctly:
data have;
informat edate date7.;
format edate date7.;
input ID edate regdays_09 regdays_10 regdays_11 regdays_12;
CARDS;
100 06jan09 365 365 365 365
101 10APR09 365 365 . .
102 23Mar09 180 . . .
103 03Sep09 365 . 365 365
104 20Aug09 300 . . 365
run;
data want;
set have;
flag = (sum(of regdays:) ne 4*365);
run;
First think 2012 has 366 days so all you patients failed. You could write an decision tree that can be implemented as an bunch of nested IF-THEN-ELSE.
if the first year has one or more days we have to check that all the next years are filled completely. If the first years have no days recorded do the do the same for the next year and so on.
IF regdays_09>0 THEN DO;
IF regdays_10=365 AND regdays_11=365 AND regdays_12=366 THEN flag=0;
ELSE flag=1;
END;ELSE IF regdays_10>0 THEN DO;
IF regdays_11=365 AND regdays_12=366 THEN flag=0;
ELSE flag=1;
END;ELSE IF regdays_11>0 THEN DO;
IF regdays_12=366 THEN flag=0;
ELSE flag=1;
END;ELSE flag=0;
Now when you become more advanced in programming you can try the same with nested loops.
Thank you all! Based on the comments I received here and also a comment from ShiroAmanda on another question, I got the following code that is what I wanted:
data have;
informat edate date7.;
format edate date7.;
input ID edate regdays_09 regdays_10 regdays_11 regdays_12;
CARDS;
100 06jan09 365 365 365 365
101 10APR09 365 365 . .
102 23Mar09 180 . . .
103 03Sep09 365 . 365 365
104 20Aug10 300 . . 365
105 10Sep12 . 365 365 365
106 12sep11 . 360 365 360
run;
data want (drop=x i mval: start);
set have;
ARRAY Mval {4} (2009,2010,2011,2012);
ARRAY Mvar {4} regdays_09 regdays_10 regdays_11 regdays_12;
do i=1 to dim(mval);
if year(edate)=mval(i) then
do x=i to dim(mval); start=i;
totalsum=sum(totalsum,mvar(x));
end;
if totalsum<(dim(mval)-start+1)*365 then flag=1;else flag=0;
end;
run;
Related
Piggy backing on a similar question I asked
(Summing a Column By Group In a Dataset With Macros)...
I have the following dataset:
Month Cost_Center Account Actual Annual_Budget
May 53410 Postage 23 134
May 53420 Postage 7 238
May 53430 Postage 98 743
May 53440 Postage 0 417
May 53710 Postage 102 562
May 53410 Phone 63 137
May 53420 Phone 103 909
May 53430 Phone 90 763
June 53410 Postage 13 134
June 53420 Postage 0 238
June 53430 Postage 48 743
June 53440 Postage 0 417
June 53710 Postage 92 562
June 53410 Phone 73 137
June 53420 Phone 103 909
June 53430 Phone 90 763
I would like to "splice" it so each month has its own respective column for Actual while summing the numeric values by Account.
So for example, I want the output to look like the following:
Account May_Actual_Sum June_Actual_Sum Annual_Budget
Postage 14562 37960 255251
Phone 4564 2660 32241
The code below provided by a fellow user works great when not needing to further dis-aggregated by month; however, I'm not sure if it's possible to do so (I tired adding a 'by month clause' - didn't work).
proc means data=Test N SUM NWAY STACKODS;
class Account_Description;
var Actual annual_budget;
by month;
ods output summary = summary_stats1;
output out = summary_stats2 N = SUM= / AUTONAME;
data want;
set summary_stats2;
run;
Use PROC MEANS to get summaries - same as last time. Please read up the documentation on PROC MEANS to understand how the CLASS statements works and how you can control the different levels of output.
Use PROC TRANSPOSE to flip the data wide. Since the budget amount is consistent across rows you'll be fine.
I'm guessing your next set of question will then be how to sort the columns correctly because your months won't sort and how to reference them dynamically to calculate the month to date changes. Which are some of the reasons why this data structure is not recommended.
data have;
input Month $ Cost_Center $ Account $ Actual Annual_Budget;
cards;
May 53410 Postage 23 134
May 53420 Postage 7 238
May 53430 Postage 98 743
May 53440 Postage 0 417
May 53710 Postage 102 562
May 53410 Phone 63 137
May 53420 Phone 103 909
May 53430 Phone 90 763
June 53410 Postage 13 134
June 53420 Postage 0 238
June 53430 Postage 48 743
June 53440 Postage 0 417
June 53710 Postage 92 562
June 53410 Phone 73 137
June 53420 Phone 103 909
June 53430 Phone 90 763
;
;
;;
run;
*summarize;
proc means data=have noprint nway;
class account month;
var actual annual_budget;
output out=temp sum=actual_total budget_total;
run;
*transpose;
proc transpose data=temp out=want prefix=Month_;
by account budget_total;
var actual_total;
id month;
run;
Output:
I cannot think of a way to generate this report using just one PROC. You will need to do some post processing of PROC MEANS or PROC SUMMARY results to get to this:
proc means data=have SUM ;
class Account month;
var Actual annual_budget;
output out = summary_stats SUM=;
run;
/* Look at summary_stats to understand it's structure here */
/* Otherwise you will not understand the following code */
proc sort data = summary_stats;
where _type_ in (2,3);
by account;
run;
data want;
set summary_stats;
by account ;
retain May_Actual_Sum June_Actual_Sum Annual_Budget_sum;
if first.account then Annual_Budget_sum = Annual_Budget;
else do;
select(month);
when ('May') May_Actual_Sum = actual;
when ('June') June_Actual_Sum = actual;
/* List other months also here. Can use some macros here to make the code compact and expandable for future enhancements */
end;
end;
if last.account then output;
keep account May_Actual_Sum June_Actual_Sum Annual_Budget_sum;
run;
The following is a brief of my data sheet,
stnd_y person_id recu_day date
2002 100 20020929 02-09-29
2002 100 20020930 02-09-30
2002 100 20021002 02-10-02
2002 101 20020927 02-09-27
2002 101 20020928 02-09-28
2002 102 20021001 02-10-01
2002 103 20021003 02-10-03
2002 104 20021108 02-11-08
2002 104 20021112 02-11-12
And, I want to make those as follows
stnd_y person_id recu_day date Admission
2002 100 20020929 02-09-29 1
2002 100 20020930 02-09-30 2
2002 100 20021002 02-10-02 3
2002 101 20020927 02-09-27 1
2002 101 20020928 02-09-28 2
2002 102 20021001 02-10-01 1
2002 103 20021003 02-10-03 1
2002 104 20021108 02-11-08 1
2002 104 20021112 02-11-12 2
I mean, I want to make a variable for admission frequency personally with recu_day and date (this variables mean the date of hospitalization).
And then, I used the following with sas,
proc sort data=old out=new;
by person_id recu_day;
data new1;
set new;
retain admission 0;
by person_id recu_day;
if recu_day^=lag(recu_day) and(or) person_id^=lag(person_id) then
admission+1;
run;
And also,
data new1;
set new ;
by person_id recu_day;
retain adm 0;
if first.person_id and(or) first.recu_day then admission=admission+1;
run;
But, those are not working.
How can I solve this? Please let me know about this.
You're pretty close with the 2nd attempt, but your main problem is that you don't reset admission each time person_id changes.
It's also not necessary to use first.recu_day as this is 1 for every record in your sample data. first.person_id is sufficient as you want to increment the number by 1 if the peson_id hasn't changed from the previous row.
Including recu_day in the by statement is useful however, as this will force an error if the data isn't sorted properly.
data have;
input stnd_y person_id recu_day date :yymmdd8.;
format date yymmdd8.;
datalines;
2002 100 20020929 02-09-29
2002 100 20020930 02-09-30
2002 100 20021002 02-10-02
2002 101 20020927 02-09-27
2002 101 20020928 02-09-28
2002 102 20021001 02-10-01
2002 103 20021003 02-10-03
2002 104 20021108 02-11-08
2002 104 20021112 02-11-12
;
run;
data want;
set have;
by person_id recu_day;
if first.person_id then admission=0;
admission+1;
run;
I need to delete duplicates from a data set. My issue is that once I sort the data and flag the duplicates (using lag function), some information across variables is present within the duplicate observation and some within the original observation. I need to retain information across all variables while also deleting the duplicates.
My thought was to first fill in all the information between both the original and duplicate before deleting the duplicate.
Example of observations after sorting data and flagging duplicates (fake data values):
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
What I want:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
AB 36 1980 45654 . 2135 1
ON 26 1990 . 35464 8868 0
ON 26 1990 . 35464 8868 1
So I can delete duplicates and eventually have this:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
ON 26 1990 . 35464 8868 0
I created lag and lead variables to attempt to fill in information but it only seems to be working on some of the data set.
Here is the code for the lead variables:
data uncleaned_data;
merge uncleaned_data
uncleaned_data(
firstobs=2
keep= TRANS_ID MORB_ID Varx
rename=(TRANS_ID=lead_TRANS_ID MORB_ID=lead_MORB_ID Varx=lead_Varx ));
if lag(flag_duplicate=1) then do;
if TRANS_ID=. then do;
TRANS_ID= lead_TRANS_ID;
end;
if MORB_ID=. then do;
MORB_ID= lead_MORB_ID;
end;
if Varx=. then do;
Varx= lead_Varx;
end;
end;
run;
I did the same kind of thing for lag variables except my initial if statement is 'if flag_duplicate=1 then do;'
This method does not seem to work for many duplicate pairs in my data set.
Is there a better way to approach my problem overall? possibly through proc SQL?
Thanks for reading and any advice offered!
I'm assuming that you don't have different values of Trans_id, for example, for the same Province. If that is the case then you can flatten the original data in one go to achieve your goal, using an update statement with a by statement. In my code, the first reference to the dataset, with obs=0, just creates the variables, the second reference populates the values and the by statement ensures that only one row is updated per Providence.
Using this method means you don't need to identify the duplicate values beforehand.
data have;
input Province $ AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate;
datalines;
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
;
run;
data want;
update have(obs=0) have;
by province;
run;
Something like this should work...
proc sort data=uncleaned_data; by Province AGE BRTHYEAR; run;
data cleaned_data (DROP=TRANS_ID RENAME=(KEEP_TRANS_ID=TRANS_ID) ...);
set uncleaned_data;
by Province AGE BRTHYEAR;
if first.BRTHYEAR then do;
keep_TRANS_ID=TRANS_ID;
...
end;
else do;
if keep_TRANS_ID=. then keep_TRANS_ID=TRANS_ID;
...
end;
if last.BRTHYEAR then output;
run;
I have an example table as below
id term subj prof hour
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
I need to divide hours if the id- term and subj same. There are 2 different prof with same id:20 - term and subj, so i divided hour 2.
There are 3 different prof with same id : 30 - term and subj. So i divided hour 3.
So the output should be like this;
id term subj prof hour
20 2016 COM James 2
20 2016 COM Henrey 2
30 2016 HUM Nelly 1
30 2016 HUM John 1
30 2016 HUM Jimmy 1
45 2016 CGS Tim 3
In SAS you can use a double DOW loop to achieve this, once the data has been sorted in the correct order. The first loop counts how many profs there are with the same id, term and subj. The second loop divides hour by the number of profs. The loops are performed at each change of id, term or subj.
I've created a new_hour variable and kept in the temporary _counter variable just so you can see the code working, you can obviously overwrite the hour variable and drop the _counter variable if you wish
/* create initial dataset */
data have;
input id term subj $ prof $ hour;
datalines;
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
;
run;
/* sort data */
proc sort data=have;
by id term subj prof;
run;
/* create output dataset */
data want;
do until(last.subj); /* 1st loop*/
set have;
by id term subj prof;
if first.subj then _counter=0; /* reset counter when id, term or subj change */
_counter+first.prof; /* count number of times prof changes */
end;
do until(last.subj); /* 2nd loop */
set have;
by id term subj;
new_hour=hour / _counter; /* divide hour by number of profs from 1st loop */
output; /* output record */
end;
run;
Assuming your problem is as simple as the one you gave as an example, one proc sql should suffice. If it is more complicated, please explain how so we can be more helpful!
data have;
input id term subj $ prof $ hour;
datalines;
20 2016 COM James 4
20 2016 COM Henrey 4
30 2016 HUM Nelly 3
30 2016 HUM John 3
30 2016 HUM Jimmy 3
45 2016 CGS Tim 3
;
run;
proc sql;
create table want as select
*, hour / count(prof) as hour_adj
from have
group by id, subj;
quit;
so i have the following dataset with three variables: account, balance, and time.
account balance time
1 110 01/2006
1 111 02/2006
1 88 03/2006
1 61 04/2006
1 1203 05/2006
2 112 01/2006
2 111 02/2006
2 665 03/2006
2 61 04/2006
2 1243 05/2006
3 110 01/2006
3 111 02/2006
3 88 03/2006
3 61 04/2006
3 1203 05/2006
each account has more records. so the starting time might be before what I wrote and the ending time might be after what i wrote.
so my question is:
I am trying to find the maximum balance for each account in prievious 12 monthes. For example, for account 3 on 05/2006, i am trying to find max(account 3 balance at 04/2006, account 3 balance at 03/2006, account 3 balance at 03/2006,............., account 3 balance at 04/2006).
what is in your mind? what i did is to use lag function with array. however, it is NOT so efficient. because i will be in trouble if i need previous 120 months.
Thank you.
Best
Xintong
Try this:
PROC SQL;
create table max_balance as
select *
from your_table
group by account
having balance=max(balance)
;
QUIT;
options mprint;
%macro lag;
data temp (drop=count i);
set balance;
by acct time;
array x(*) lag_bal1 - lag_bal3;
%do i=1 %to 3;
lag_bal&i=lag&i.(balance);
%end;
if first.acct then count=1;
do i=count to dim(x);
call missing(x(i));
end;
count+1;
max_ball=max(balance, of lag_bal1-lag_bal3);
%mend;
%lag;