Observations from "other_claims" data set are to summed with the observations in the "event_claims" data set under the following conditions:
"Other_claims" occur within a 90-day window of the event_claim, "stay_discharge_dt," are to be summed with the event cost("cost_event").
If the "other_claim" partially overlaps with the 90-day period, only overlapping days are to be included.
The included fraction: (# of overlapping days)/(total # of days of the other_claim)
Here's the sql solution I am considering. I'm curious if this could be more efficient?
data event_claims;
input patient_id stay_admission_dt mmddyy10. #14stay_discharge_dt mmddyy10. doctor cost_event;
format stay_admission_dt stay_discharge_dt mmddyy10.;
datalines;
1 06/10/2019 06/15/2019 45 20000
2 10/18/2018 10/22/2018 78 30000
;
data other_claims;
length patient_id 3. type $19;
input patient_id Type$ service_start_date :mmddyy10. service_end_date :mmddyy10. service_cost dollar7.0;
format service_start_date service_end_date mmddyy10.;
datalines;
1 skilled_nursing 06/15/2019 06/25/2019 $7,000
1 home-health 06/25/2019 08/25/2019 $24,000
1 office_visit 07/1/2019 07/1/2019 $200
1 home_health 08/26/2019 09/26/2019 $12,000
2 er_visit 10/15/2018 10/16/2018 $1,500
2 home_health 10/23/2018 11/23/2018 $8,000
2 outpatient_services 01/18/2019 1/22/2019 $5,000
;
proc sql;
create table events_others as
select a.person_id
,a.stay_admission_dt
,a.stay_discharge_dt
,a.stay_discharge_dt+90 as service_deadline format mmddyy10.
,b.service_start_date
,b.service_end_date
,case when b.service_start_date > calculated service_deadline
or b.service_start_date < a.stay_admission_dt
then "service not payable"
else "payable" end as payable
,case when calculated payable = "payable"
and b.service_end_date > calculated service_deadline
then intck("days",b.service_end_date, calculated service_deadline )
else 0 end as overlap /* When the other claim event exceeds the 90-day window of the*/
,a.service_cost
,b.service_cost as service_cost_other
,case when calculated overlap ne 0
then (intck("days",b.service_start_date,b.service_end_date) + calculated overlap)/intck("days",b.service_start_date,b.service_end_date)
else 0 end as partial_factor
,calculated partial_factor * b.service_cost as final_other_cost format=dollar9.2
from event_claims a
left join other_claims b
on a.person_id=b.person_id
group by a.person_id
,a.stay_admission_dt
,a.stay_discharge_dt
order by a.person_id
,a.stay_admission_dt
;quit;
proc sql;
create table total_cost_of_care as
select a.*
,b.final_other_cost format=dollar9.2
,a.service_cost + final_other_cost as total_episode_cost format=dollar12.2
from events_others a
inner join
(select person_id
,stay_admission_dt
,sum(final_other_cost) as final_other_cost
from events_others
group by person_id
,stay_admission_dt
) b
on (a.person_id=b.person_id
and a.stay_admission_dt=b.stay_admission_dt)
;quit;
Related
Since I am new to SAS I need some help to understand how to combine the overlap date ranges into one row.I want to combine the overlap date ranges when they have matching Id. If the dates don’t overlap then I want to keep them as it is. IF they over lap by Matching Id and drug code Then it should combine into one line. Please look at the same ple data set which I have below and the expected results:
Current Data set:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 1/1/2019
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
Expected results:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 3/31/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/2/2018 8/31/2018
1 100 9/1/2018 5/4/2019
I wrote some SAS code but I am combining the dates even when there is no overlap. I want to write some code which should work in SAS.
PROC SORT DATA=Want OUT=ONE;
BY PERSON_ID BEG_DATE DRUG_CODE END_DATE;
RUN;
data TWO (DROP=PERSON_ID2 DRUG_CODE2 BEG_DATE END_DATE
RENAME=(BEG2=BEG_DOS
END2=END_DOS));
SET ONE;
RETAIN BEG2 END2;
PERSON_ID2=LAG1(PERSON_ID);
DRUG_CODE2=LAG1(DRUG_CODE);
IF PERSON_ID2=PERSON_ID AND DRUG_CODE2=DRUG_CODE AND BEG_DATE LE(END2+1) THEN
DO;
BEG2=MIN(BEG_DATE,BEG2);
END2=MAX(END_DATE,END2);
END;
ELSE
DO;
SEG+1;
BEG2=BEG_DATE;
END2=END_DATE;
END;
FORMAT BEG2 END2 MMDDYY10.;
RUN;
DATA THREE(DROP=BEG_DOS END_DOS SEG);
RETAIN BEG_DATE END_DATE;
SET TWO;
BY PERSON_ID SEG;
FORMAT BEG_DATE END_DATE MMDDYY10.;
IF FIRST.SEG THEN
DO;
BEG_DATE=BEG_DOS;
END;
IF LAST.SEG THEN
DO;
END_DATE = END_DOS;
OUTPUT;
END;
RUN;
This is how I would do it. Create an obs for each ID DRUG and DATE. Flag the gaps and summarize by RUN.
data have;
input ID Drug_Code (BEG End)(:mmddyy.);
format BEG End mmddyyd10.;
cards;
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 90 6/1/2018 8/15/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
;;;;
run;
proc print;
run;
/*1 100 1/1/2018 1/1/2019*/
data exv/ view=exv;
set have;
do date = beg to end;
output;
end;
drop beg end;
format date mmddyyd10.;
run;
proc sort data=exv out=ex nodupkey;
by id drug_code date;
run;
data breaksV / view=BreaksV;
set ex;
by id drug_code;
dif = dif(date);
if first.drug_code then do; dif=1; run=1; end;
if dif ne 1 then run+1;
run;
proc summary data=breaksV nway missing;
class id drug_code run;
var date;
output out=want(drop=_type_) min=Begin max=End;
run;
Proc print;
run;
Computing the extent range composed of overlapping segment ranges requires a good understanding of the range conditions (cases).
Consider the scenarios when sorted by start date (within any larger grouping set, G, such as id and drug)
Let [ and ] be endpoints of a range
# be date values (integers) within
Extent be the combined range that grows
Segment be the range in the current row
Case 1 - Growth. Within G Segment start before Extent end
Segment will either not contribute to Extent or extend it.
[####] Extent
+ [#] Segment range DOES NOT contribute
--------
[####] Extent (do not output a row, still growing)
or
[####] Extent
+ [#####] Segment range DOES contribute
--------
[#######] Extent (do not output a row, still growing)
Case 2 - Terminus. 3 possibilities:
Within G Segment start after Extent end,
Next G reached (different id/drug combination),
End of data reached.
#2 and #3 can be tested by checking the appropriate last. flag.
[####] Extent
+ ..[#] Segment beyond Extent (gap is 2)
--------
[####] output Extent
[#] reset Extent to Segment
You can adjust your rules for Segment being adjacent (gap=0) or close enough (gap < threshold) to mean an Extent is either expanded, or, output and reset to Segment.
Note: The situation is a little more (not shown) complicated for the real world cases of:
missing start means the Segment has an unknown start date (presume it to be epoch (0=01JAN1960, or some date that pre-dates all dates in the data or study)
missing end means the Segment is active today (end date is date when processing data)
Sample code:
data have;
call streaminit(42);
do id = 1 to 10;
do _n_ = 1 to 50;
drug = ceil(rand('UNIFORM', 10));
beg_date = intnx ('MONTH', '01JAN2008'D, rand('UNIFORM',20));
end_date = intnx ('DAY', beg_date, rand('UNIFORM',75));
OUTPUT;
end;
end;
format beg_date end_date yymmdd10.;
run;
proc sort data=have out=segments;
by id drug beg_date end_date;
run;
data want;
set segments;
by id drug beg_date end_date; * will error if incoming data is NOT sorted;
retain ext_beg ext_end;
retain gap_allowed 0; * set to 1 for contiguously adjacent segment ;
if first.drug then do;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 0;
end;
if beg_date <= ext_end + gap_allowed then do;
ext_end = max (ext_end, end_date);
segment_count + 1;
end;
else do;
extent_id + 1;
OUTPUT;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 1;
end;
if last.drug then do;
extent_id + 1;
OUTPUT;
* reset occurs implicitly;
* it will happen at first. logic when control returns to top of step;
end;
format ext_: yymmdd10.;
keep id drug ext_beg ext_end segment_count extent_id;
run;
I have a dataset of patient data with each diagnosis on a different line.
This is an example of what it looks like:
patientID diabetes cancer age gender
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
I need to isolate the patients who have a diagnosis of both diabetes and cancer; their unique patient identifier is patientID. Sometimes they are both on the same line, sometimes they aren't. I am not sure how to do this because the information is on multiple lines.
How would I go about doing this?
This is what I have so far:
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(DOB) as DOB
from diab_dx
group by patientID;
quit;
data final; set want;
if diabetes GE 1 AND cancer GE 1 THEN both = 1;
else both =0;
run;
proc freq data=final;
tables both;
run;
Is this correct?
If you want to learn about data steps lookup how this works.
data pat;
input patientID diabetes cancer age gender:$1.;
cards;
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
;;;;
run;
data both;
do until(last.patientid);
set pat; by patientid;
_diabetes = max(diabetes,_diabetes);
_cancer = max(cancer,_cancer);
end;
both = _diabetes and _cancer;
run;
proc print;
run;
add a having statement at the end of sql query should do.
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(age) as DOB
from PAT
group by patientID
having calculated diabetes ge 1 and calculated cancer ge 1;
quit;
You might find some coders, especially those coming from statistical backgrounds, are more likely to use Proc MEANS instead of SQL or DATA step to compute the diagnostic flag maximums.
proc means noprint data=have;
by patientID;
output out=want
max(diabetes) = diabetes
max(cancer) = cancer
min(age) = age
;
run;
or for the case of all the same aggregation function
proc means noprint data=have;
by patientID;
var diabetes cancer;
output out=want max= ;
run;
or
proc means noprint data=have;
by patientID;
var diabetes cancer age;
output out=want max= / autoname;
run;
Sample Data
I was wondering if it is possible to use data instead of proc to count the number of categorical variables on a row as shown in 'count' example above. This will allow me to further use the data e.g COUNT=1 or COUNT > 1 to check morbidity.
Also will it be possible to then count the number of each diagnosis in the entire data set per patient while accounting for duplicates if there is any? For example there are 3 CB's and 2 AA's in this data set but CB should be 2 because patient 2 had it recorded twice.
Thank you for your time and have a lovely new year.
Your question is not clear but your could manage your diag using union all and count distinct
selec patient count(distinct diag )
from (
select patient, diag1 as diag
from my_table
uniona all
select patient, diag2
from my_table
uniona all
select patient, diag3
from my_table
uniona all
select patient, diag4
from my_table
) t
group by patient
or simply union and count
selec patient count(diag )
from (
select patient, diag1 as diag
from my_table
uniona
select patient, diag2
from my_table
uniona
select patient, diag3
from my_table
uniona
select patient, diag4
from my_table
) t
group by patient
The image indicates that for each row you want a count of the number of columns with non-missing values. Additionally, you apparently have some way to do this using a PROC step, but would like to know how using a DATA step.
In DATA step you can count the number of non-missing values indirectly using CMISS, or directly using COUNTC against a constructed value:
data have;
attrib pid length=8 diag1-diag4 length=$5;
input pid & diag1-diag4;
datalines;
1 AA J9 HH6 .
2 CB . . CB
3 J10 AA CB J10
4 B B . F90 .
5 J10 . . .
6 . . . .
run;
data have_with_count;
set have;
count = 4 - cmiss (of diag1-diag4);
count_way2 = countc(catx('~', of diag1-diag4, 'SENTINEL'), '~');
run;
In order to work again MySQL data source you will also need a libref that connects you to that remote data server.
Added
Counting distinct values across a row can be accomplished using a hash or sortc. Consider this example that sorts a copy of the row data (as an array) and counts the unique values within:
data want;
set have;
array diag diag1-diag4;
array v(4) $5 _temporary_;
do _n_ = 1 to dim(diag);
v(_n_) = diag(_n_);
end;
call sortc(of v(*));
uniq = 0;
do _n_ = 1 to dim(v);
if missing(v(_n_)) then continue;
if uniq = 0 then
uniq + 1;
else
uniq + ( v(_n_) ne v(_n_-1) );
end;
run;
With Richard's dummy data to count number of diagnosis and unique number of diagnosis:
data want;
set have;
array var diag:;
length temp $30.;
call missing(diag_num);
do over var;
if not missing(var) then do;
diag_num+1;
temp=ifc(whichc(var, temp),temp,catx(' ',temp,var));
end;
end;
unique_diag=countw(temp);
drop temp;
run;
I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.
I have a data set with daily data in SAS. I would like to convert this to monthly form by taking differences from the previous month's value by id. For example:
thedate, id, val
2012-01-01, 1, 10
2012-01-01, 2, 14
2012-01-02, 1, 11
2012-01-02, 2, 12
...
2012-02-01, 1, 20
2012-02-01, 2, 15
I would like to output:
thedate, id, val
2012-02-01, 1, 10
2012-02-01, 2, 1
Here is one way. If you license SAS-ETS, there might be a better way to do it with PROC EXPAND.
*Setting up the dataset initially;
data have;
informat thedate YYMMDD10.;
input thedate id val;
datalines;
2012-01-01 1 10
2012-01-01 2 14
2012-01-02 1 11
2012-01-02 2 12
2012-02-01 1 20
2012-02-01 2 15
;;;;
run;
*Sorting by ID and DATE so it is in the right order;
proc sort data=have;
by id thedate;
run;
data want;
set have;
retain lastval; *This is retained from record to record, so the value carries down;
by id thedate;
if (first.id) or (last.id) or (day(thedate)=1); *The only records of interest - the first record, the last record, and any record that is the first of a month.;
* To do END: if (first.id) or (last.id) or (thedate=intnx('MONTH',thedate,0,'E'));
if first.id then call missing(lastval); *Each time ID changes, reset lastval to missing;
if missing(lastval) then output; *This will be true for the first record of each ID only - put that record out without changes;
else do;
val = val-lastval; *set val to the new value (current value minus retained value);
output; *put the record out;
end;
lastval=sum(val,lastval); *this value is for the next record;
run;
You could achieve this using a PROC SQL, and the intnx function to bring last months date forward a month...
proc sql ;
create table lag as
select b.thedate, b.id, (b.val - a.val) as val
from mydata b
left join
mydata a on b.date = intnx('month',a.date,1,'s')
and b.id = a.id
order by b.date, b.id ;
quit ;
This may need tweaking to handle scenarios where the previous month doesn't exist or months which have a different number of days to the previous month.