I have a dataset that looks like this:
ID start_date end_date
1 01/01/2022 01/02/2022
1 01/02/2022 01/05/2022
1 01/06/2022 01/07/2022
2 01/09/2019 01/22/2022
2 06/07/2014 09/10/2015
3 11/10/2012 02/01/2013
I am trying to create a dummy indicator to show events that are back-to-back. So far, I have been able to do the following:
data df_1;
set df_2;
by ID end_date;
lag_epi_e = lag(end_date);
if not (first.ID) then do;
date_diff= start_date- lag(end_date);
end;
format lag_epi_e date9.;
run;
The issue with this code is that it will create an indicator to show that events are back to back but is does not create an indicator for the first event, only the follow up events. Here is an example of how it looks below:
ID start_date end_date b2b_ind
1 01/01/2022 01/02/2022 0
1 01/02/2022 01/05/2022 1
1 01/06/2022 01/07/2022 1
How can I rewrite the code so that all events take on an indicator of 1 when they are back-to-back?
Do you want 1 at first record as well?
If so you can set that, but what happens if the next record set is not back to back?
May help to show your expected output.
Note you should also use the calculated lag variable outside the IF statement
otherwise you'll get unexpected results.
data df_1;
set df_2;
by ID end_date;
lag_epi_e = lag(end_date);
if not (first.ID) then do;
date_diff= start_date- lag_epi_e;
end;
else if first.id then date_diff=1;
format lag_epi_e date9.;
run;
In your case, you'll want to check if both a leading and lagging event are butted up together. Since lead is not a function in SAS, you can use one of the many ways to accomplish it. My favorite is from this SGF paper: Calculating Leads (and Lags) in SASĀ®: One Problem, Many Solutions
Let's add a lead to your data. This code is doing three things:
Opening up your dataset df_1 in the "background"
Fetching the n + 1th observation of start_date and saving it to a variable
Setting it to missing if we're on the last id
Code:
data want;
set df_1;
by ID end_date;
retain _dsid_;
if(_N_ = 1) then _dsid_ = open("have");
_lead_rc_ = fetchobs(_dsid_, _N_+1);
lead_start_date = getvarn(_dsid_, varnum(_dsid_, "start_date"));
lag_end_date = lag(end_date);
if(first.id) then call missing(lag_end_date);
if(last.id) then call missing(lead_start_date);
b2b_ind = ( (0 LE (lead_start_date - end_date) LE 1)
OR (0 LE (start_date - lag_end_date) LE 1)
);
drop _lead_rc_ _dsid_;
format lead_start_date lag_end_date mmddyy10.;
run;
Output:
id start_date end_date lead_start_date lag_end_date b2b_ind
1 01/01/2022 01/02/2022 01/02/2022 . 1
1 01/02/2022 01/05/2022 01/06/2022 01/02/2022 1
1 01/06/2022 01/07/2022 . 01/05/2022 1
2 06/07/2014 09/10/2015 01/09/2019 . 0
2 01/09/2019 01/22/2022 . 09/10/2015 0
3 11/10/2012 02/01/2013 . . 0
You can optionally do this in two passes if you have SAS/ETS:
proc expand data=df_1 out=df1_lead(drop=time);
by id;
convert start_date = lead_start_date / transform=(lead 1);
run;
data df_2; input ID $ start_date : mmddyy10. end_date : mmddyy10.;
format start_date end_date date9.;
pk=_n_;
cards;
1 01/01/2022 01/02/2022
1 01/02/2022 01/05/2022
1 01/06/2022 01/07/2022
2 01/09/2019 01/22/2022
2 06/07/2014 09/10/2015
3 11/10/2012 02/01/2013
;
run;
proc sql;
create table df_1(drop=pk) as select distinct d1.*,
abs(start_date2-end_date)<=1 or abs(start_date-end_date2)<=1 as b2b_ind
from df_2 d1 cross join df_2(rename=(start_date=start_date2 end_date=end_date2
pk=pk2)) d2
having b2b_ind=1 and pk^=pk2
order by ID,start_date,end_date;
quit;
Related
What am I trying to do is the following:
Have this table:
Table1
Item Date1 Date 2
1 6/1/2021 7/31/2021
2 7/4/2021 7/30/2021
3 6/20/2021 7/28/2021
....
My want table is the following:
Item Date
1 6/1/2021
1 6/6/2021
1 6/11/2021
1 6/16/2021
...
Basically I am trying to create a date by incrementing 5 days from the start date until the last date.
Something like this should get you started:
data want;
set have;
format date date1 date2 date9.;
do Date=date1 to date2 by 5;
Date = MIN(Date, Date2);
output;
end;
*keep Item Date;
run;
Date set having id and date .I want a date set with two duplicate id but condition is that one should be before 8th June and other should be after 8th June.
To take the first date and the first date after 2021-06-08 you can sort by ID and DATE and use LAG() to detect when you cross the date boundary.
data have ;
input id date :date. ;
format date date9.;
cards;
1 01jun2021
1 07jun2021
1 08jun2021
1 09jun2021
;
data want;
set have ;
by id date;
if first.id or ( (date<='08JUN2021'd) ne lag(date<='08JUN2021'd));
run;
results
Obs id date
1 1 01JUN2021
2 1 09JUN2021
I have difficulties coming up with a nice code for counting the number of times date intervals includes the 15th of the month between 01.01.2019-31.12.2020. As a simple example we can consider the two intervals:
Obs | DateStart | DateEnd
1 14Oct2019 20Mar2020
2 13Nov2018 29Jan2020
I want to determine how many overlaps these have with the 15th of the months between 01Jan2019-31Dec2020. In the end I would like to produce a crosstable showing year in rows and months in columns with a count for each time the above intervals included the 15th of the month. From the above two date intervals, I would like an output like the following:
Months: 1 2 3 4 5 6 7 8 9 10 11 12
2019 1 1 1 1 1 1 1 1 1 2 2 2
2020 2 1 1 . . . . . . . . .
I am currently trying to set up a dataset with columns 1 through 24, which I will then reformat and later cross in a proc freq. This seems like the long way around, and I am having trouble identifying when the date intervals include the 15th of any month.
I have ~100 observations to do this for. Any help would be appreciated.
You can iterate over the date range and output to a second data set (or view) when the date is a 15th. The second data set can be tabulated for frequency counts.
Example:
data have;
call streaminit(2021);
do _n_ = 1 to 1000;
date1 = today() - rand('integer', 2500);
date2 = date1 + rand('integer', 720);
output;
end;
run;
data _15ths(keep=year month);
set have;
do date = date1 to date2;
if day(date) ne 15 then continue;
year = year(date);
month = month(date);
output;
end;
run;
proc tabulate data=_15ths;
title 'Number of date ranges with a 15th';
class year month;
table year='',month*n=''/nocellmerge box='year';
run;
Will produce the following output
Hi I am trying to calculate how much the customer paid on the month by subtracting their balance from the next month.
Data looks like this: I want to calculate PaidAmount for A111 in Jun-20 by Balance in Jul-20 - Balance in June-20. Can anyone help, please? Thank you
For this situation there is no need to look ahead as you can create the output you want just by looking back.
data have;
input id date balance ;
informat date yymmdd10.;
format date yymmdd10.;
cards;
1 2020-06-01 10000
1 2020-07-01 8000
1 2020-08-01 5000
2 2020-06-01 10000
2 2020-07-01 8000
3 2020-08-01 5000
;
data want;
set have ;
by id date;
lag_date=lag(date);
format lag_date yymmdd10.;
lag_balance=lag(balance);
payment = lag_balance - balance ;
if not first.id then output;
if last.id then do;
payment=.;
lag_balance=balance;
lag_date=date;
output;
end;
drop date balance;
rename lag_date = date lag_balance=balance;
run;
proc print;
run;
Result:
Obs id date balance payment
1 1 2020-06-01 10000 2000
2 1 2020-07-01 8000 3000
3 1 2020-08-01 5000 .
4 2 2020-06-01 10000 2000
5 2 2020-07-01 8000 .
6 3 2020-08-01 5000 .
This is looking for a LEAD calculation which is typically done via PROC EXPAND but that's under the SAS/ETS license which not many users have. Another option is to merge the data with itself, offsetting the records by one so that the next months record is on the same line.
data want;
merge have have(firstobs=2 rename=balance = next_balance);
by clientID;
PaidAmount = Balance - next_balance;
run;
If you can be missing months in your series this is not a good approach. If that is possible you want to do an explicit merge using SQL instead. This assumes you have month as a SAS date as well.
proc sql;
create table want as
select t1.*, t1.balance - t2.balance as paidAmount
from have as t1
left join have as t2
on t1.clientID = t2.ClientID
/*joins current month with next month*/
and intnx('month', t1.month, 0, 'b') = intnx('month', t2.month, 1, 'b');
quit;
Code is untested as no test data was provided (I won't type out your data to test code).
I want to ask a complicated (for me) question about SAS programming. I think I can explain better by using simple example. So, I have the following dataset:
Group Category
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
I want to count the each category for each group. I can do it by using PROC FREQ. But it is not better way for my dataset. It will be time consuming for me as my dataset is too large and I have a huge number of groups. So, if I use PROC FREQ, firstly I need to create new datasets for each group and then use PROC FREQ for each group. In sum, I need to create the following dataset:
CATEGORIES
Group 1 (first category) 2 3
A 3 2 1
B 2 3 1
So, the number of first category in group A is 3. The number of first category in group B is 2 and so on. I think I can explain it. Thanks for your helps.
There is more than one way to do this in SAS. My bias is proc sql, so:
proc sql;
select grp,
sum(case when category = 1 then 1 else 0 end) as cat_1,
sum(case when category = 2 then 1 else 0 end) as cat_2,
sum(case when category = 3 then 1 else 0 end) as cat_3
from t
group by grp;
Either proc freq or proc summary will do the job of producing frequency counts:
data example;
length group category $1;
input group category;
cards;
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
;
run;
proc freq data=example;
table group*category;
run;
proc summary data=example nway;
class group category;
output out=example_frequency (drop=_type_);
run;
proc summary will produce a dataset in a 'long' format. If you need to transpose it (I'd suggest not doing so: you'll probably find working with the long format easier in most circumstances) you can use proc transpose:
proc transpose data=example_frequency out=example_matrix (drop=_name_);
by group;
id category;
var _freq_;
run;