Reducing database with minimum entries - sas

I have a dataset i.e. -
Coverage_Start Termination_Date Member_Id
24-Jul-19 1-Jun-21 42968701
24-Jul-19 1-Mar-21 42968701
29-Feb-20 1-Mar-20 42968701
16-Feb-19 1-Mar-19 42968701
1-Mar-17 1-Mar-18 42968701
1-Mar-16 1-Mar-17 42968701
1-Dec-15 31-Dec-16 42968701
I want to reduce this dataset, suppose in last three rows minimum coverage_start- 1-Dec-15 and maximum termination_date- 1-Mar-18, so I want to combine all three bottom rows because it has continuous coverage.
As result the bottom three rows will be reduced to "1-Dec-15 1-Mar-18 42968701".
Reduced Dataset should be like -
Coverage_Start Termination_Date Member_Id
24-Jun-19 1-Jun-21 42968701
16-Feb-19 1-Mar-19 42968701
1-Dec-15 1-Mar-18 42968701
I want to achieve this task using SAS programming.
Can anyone please help me with this? I'm trying this since a very log time but couldn't achieve it.

Try this:
Data have;
infile datalines delimiter=',';
input Coverage_Start date9. Termination_Date date9. Member_Id $;
format Coverage_Start Termination_Date ddmmyyp10.;
datalines;
24-Jul-19,1-Jun-21,42968701
24-Jul-19,1-Mar-21,42968701
29-Feb-20,1-Mar-20,42968701
16-Feb-19,1-Mar-19,42968701
1-Mar-17,1-Mar-18,42968701
1-Mar-16,1-Mar-17,42968701
1-Dec-15,31-Dec-16,42968701
;
Run;
Proc sort data=have;
By member_id Coverage_Start Termination_Date;
Run;
Data _temp;
Set have;
By member_id;
Retain max_term_date;
If first.member_id then do;
count_id = 1;
max_term_date = .;
End;
Else if Coverage_Start > max_term_date then count_id + 1;
max_term_date = max(Termination_Date,max_term_date);
Run;
Proc sql;
Create table want(drop=count_id) as
Select member_id,count_id
,min(Coverage_Start) as Coverage_Start format=date9.
,max(Termination_Date) as Termination_Date format=date9.
From _temp
Group by 1,2;
Quit;

data want (keep=Member_Id Coverage_Start Termination_Date);
merge have (rename=(Termination_Date=Expiration_Date))
have (firstobs=2 keep=Member_Id Termination_Date
rename=(Member_Id = Next_Id Termination_Date=Expiration_Next));
format Termination_Date date9.;
retain Termination_Date;
if Member_Id ne lag(Member_Id)
or lag(Coverage_Start) gt Expiration_Date + 1
then Termination_Date = Expiration_Date;
if_ = Next_Id ne Member_Id
or Expiration_Next lt Coverage_Start - 1;
run;

Related

Reducing Dataset Using SAS Programming

I have a dataset i.e. -
Coverage_Start Termination_Date Member_Id
24-Jul-19 1-Jun-21 42968701
24-Jul-19 1-Mar-21 42968701
29-Feb-20 1-Mar-20 42968701
16-Feb-19 1-Mar-19 42968701
1-Mar-17 1-Mar-18 42968701
1-Mar-16 1-Mar-17 42968701
1-Dec-15 31-Dec-16 42968701
I want to reduce this dataset, suppose in last three rows minimum coverage_start- 1-Dec-15 and maximum termination_date- 1-Mar-18, so I want to combine all three bottom rows because it has continuous coverage.
As result the bottom three rows will be reduced to "1-Dec-15 1-Mar-18 42968701".
Reduced Dataset should be like -
Coverage_Start Termination_Date Member_Id
24-Jun-19 1-Jun-21 42968701
16-Feb-19 1-Mar-19 42968701
1-Dec-15 1-Mar-18 42968701
I want to achieve this task using SAS programming.
Can anyone please help me with this? I'm trying this since a very log time but couldn't achieve it.
This is how I would do it. I do not get same answer as you for rng 2 not sure if I don't understand or you have it wrong.
proc datasets kill nolist; quit;
/*
Coverage_Start Termination_Date Member_Id
29-Feb-20 1-Jun-21 42968701
16-Feb-19 1-Mar-19 42968701
1-Dec-15 1-Mar-18 42968701
*/
filename FT15F001 temp;
data cover0(keep=id date) / view=cover0;
infile FT15F001 firstobs=2;
input (start term) (:date.) id:$10.;
do date=start to term;
output;
end;
format date date11.;
parmcards;
Coverage_Start Termination_Date Member_Id
24-Jul-19 1-Jun-21 42968701
24-Jul-19 1-Mar-21 42968701
29-Feb-20 1-Mar-20 42968701
16-Feb-19 1-Mar-19 42968701
1-Mar-17 1-Mar-18 42968701
1-Mar-16 1-Mar-17 42968701
1-Dec-15 31-Dec-16 42968701
;;;;
run;
proc sort nodupkey data=cover0 out=cover1;
by id date;
run;
data cover2/ view=cover2;
set cover1;
by id;
dif = dif(date);
if first.id then do;
dif=1;
rng = 0;
end;
if dif ne 1 then rng + 1;
run;
proc summary data=cover2 nway missing;
class id;
class rng / descend;
output out=reduce(drop=_type_) min(date)=Start max(date)=Term;
run;
proc print;
run;

SAS cumulative count by unique ID and date

I have a dataset like below
Customer_ID Vistited_Date
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
I am trying find the cumulative unique count of customers by date, assuming my output will be like below
Cust_count Vistited_Date
3 7-Feb-20
2 14-Feb-20
7-Feb-2020 has 3 unique customers, whereas 14-Feb-2020 has only 2 hence customer 1234 has visited already.
Anyone knows how I could develop a data set in these conditions?
Sorry if my question is not clear enough, and I am available to give more details if necessary.
Thanks!
NOTE: #draycut's answer has the same logic but is faster, and I will explain why.
#draycut's code uses one hash method, add(), using the return code as test for conditional increment. My code uses check() to test for conditional increment and then add (which will never fail) to track. The one method approach can be perceived as being anywhere from 15% to 40% faster in performance (depending on number of groups, size of groups and id reuse rate)
You will need to track the IDs that have occurred in all prior groups, and exclude the tracked IDs from the current group count.
Tracking can be done with a hash, and conditional counting can be performed in a DOW loop over each group. A DOW loop places the SET statement inside an explicit DO.
Example:
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data counts(keep=date count);
if _n_ = 1 then do;
declare hash tracker();
tracker.defineKey('id');
tracker.defineDone();
end;
do until (last.date);
set have;
by date;
if tracker.check() ne 0 then do;
count = sum(count, 1);
tracker.add();
end;
end;
run;
Raw performance benchmark - no disk io, cpu required to fill array before doing hashing, so those performance components are combined.
The root performance is how fast can new items be added to the hash.
Simulate 3,000,000 'records', 1,000 groups of 3,000 dates, 10% id reuse (so the distinct ids will be ~2.7M).
%macro array_fill (top=3000000, n_group = 1000, overlap_factor=0.10);
%local group_size n_overlap index P Q;
%let group_size = %eval (&top / &n_group);
%if (&group_size < 1) %then %let group_size = 1;
%let n_overlap = %sysevalf (&group_size * &overlap_factor, floor);
%if &n_overlap < 0 %then %let n_overlap = 0;
%let top = %sysevalf (&group_size * &n_group);
P = 1;
Q = &group_size;
array ids(&top) _temporary_;
_n_ = 0;
do i = 1 to &n_group;
do j = P to Q;
_n_+1;
ids(_n_) = j;
end;
P = Q - &n_overlap;
Q = P + &group_size - 1;
end;
%mend;
options nomprint;
data _null_ (label='check then add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.check() ne 0 then do;
count = sum(count,1);
h.add();
end;
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data _null_ (label='just add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.add() = 0 then
count = sum(count,1);
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data have;
input Customer_ID Vistited_Date :anydtdte12.;
format Vistited_Date date9.;
datalines;
1234 7-Feb-2020
4567 7-Feb-2020
9870 7-Feb-2020
1234 14-Feb-2020
7654 14-Feb-2020
3421 14-Feb-2020
;
data want (drop=Customer_ID);
if _N_=1 then do;
declare hash h ();
h.definekey ('Customer_ID');
h.definedone ();
end;
do until (last.Vistited_Date);
set have;
by Vistited_Date;
if h.add() = 0 then Count = sum(Count, 1);
end;
run;
If your data is not sorted and you like the SQL maybe this solution is same good for you and it is very simple:
/* your example 3 rows */
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
1234 15-Feb-20
7654 15-Feb-20
1111 15-Feb-20
;
run;
/* simple set theory. Final dataset contains your final data like results
below*/
proc sql;
create table temp(where =(mindate=date)) as select
ID, date,min(date) as mindate from have
group by id;
create table final as select count(*) as customer_count,date from temp
group by date;
quit;
/* results:
customer_count Date
3 07.febr.20
2 14.febr.20
1 15.febr.20
*/
Another method cause I dont know hash so well. >_<
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data want;
length Used $200.;
retain Used;
set have;
by Date;
if first.Date then count = .;
if not find(Used,cats(ID)) then do;
count + 1;
Used = catx(',',Used,ID);
end;
if last.Date;
put Date= count=;
run;
If you are not overly concerned with processing speed and want something simple:
proc sort data=have;
by id date;
** Get date of each customer's first unique visit **;
proc sort data=have out=first_visit nodupkey;
by id;
proc freq data=first_visit noprint;
tables date /out=want (keep=date count);
run;

SAS Fast forward a date until a limit using INTNX/INTCK

I'm looking to take a variable observation's date and essentially keep rolling it forward by its specified repricing parameter until a target date
the dataset being used is:
data have;
input repricing_frequency date_of_last_repricing end_date;
datalines;
3 15399 21367
10 12265 21367
15 13879 21367
;
format date_of_last_repricing end_date date9.;
informat date_of_last_repricing end_date date9.;
run;
so the idea is that i'd keep applying the repricing frequency of either 3 months, 10 months or 15 months to the date_of_last_repricing until it is as close as it can be to the date "31DEC2017". Thanks in advance.
EDIT including my recent workings:
data want;
set have;
repricing_N = intck('Month',date_of_last_repricing,'31DEC2017'd,'continuous');
dateoflastrepricing = intnx('Month',date_of_last_repricing,repricing_N,'E');
format dateoflastrepricing date9.;
informat dateoflastrepricing date9.;
run;
The INTNX function will compute an incremented date value, and allows the resultant interval alignment to be specified (in your case the 'end' of the month n-months hence)
data have;
format date_of_last_repricing end_date date9.;
informat date_of_last_repricing end_date date9.;
* use 12. to read the raw date values in the datalines;
input repricing_frequency date_of_last_repricing: 12. end_date: 12.;
datalines;
3 15399 21367
10 12265 21367
15 13879 21367
;
run;
data want;
set have;
status = 'Original';
output;
* increment and iterate;
date_of_last_repricing = intnx('month',
date_of_last_repricing, repricing_frequency, 'end'
);
do while (date_of_last_repricing <= end_date);
status = 'Computed';
output;
date_of_last_repricing = intnx('month',
date_of_last_repricing, repricing_frequency, 'end'
);
end;
run;
If you want to compute only the nearest end date, as when iterating by repricing frequency, you do not have to iterate. You can divide the months apart by the frequency to get the number of iterations that would have occurred.
data want2;
set have;
nearest_end_month = intnx('month', end_date, 0, 'end');
if nearest_end_month > end_date then nearest_end_month = intnx('month', nearest_end_month, -1, 'end');
months_apart = intck('month', date_of_last_repricing, nearest_end_month);
iterations_apart = floor(months_apart / repricing_frequency);
iteration_months = iterations_apart * repricing_frequency;
nearest_end_date = intnx('month', date_of_last_repricing, iteration_months, 'end');
format nearest: date9.;
run;
proc sql;
select id, max(date_of_last_repricing) as nearest_end_date format=date9. from want group by id;
select id, nearest_end_date from want2;
quit;

Calculating correlation and covariance for a event window in SAS

I have to calculate the correlation and covariance for my daily sales values for an event window. The event window is of 45 day period and my data looks like -
store_id date sales
5927 12-Jan-07 3,714.00
5927 12-Jan-07 3,259.00
5927 14-Jan-07 3,787.00
5927 14-Jan-07 3,480.00
5927 17-Jan-07 3,646.00
5927 17-Jan-07 3,316.00
4978 18-Jan-07 3,530.00
4978 18-Jan-07 3,103.00
4978 18-Jan-07 3,026.00
4978 21-Jan-07 3,448.00
Now, for every store_id, date combination, I need to go back 45 days (there is more data for each combination in my original data set) calculate the correlation between sales and lag(sales) i.e. autocorrelation of degree one. As you can see, the date column is not continuous. So something like (date - 45) is not going to work.
I have gotten till this part -
data ds1;
set ds;
by store_id;
LAG_SALE = lag(sales);
IF FIRST.store_idTHEN DO;
LAG_SALE = .;
END;
run;
For calculating correlation and covariances -
proc corr data=ds1 outp=Corr
by store_id date;
cov; /** include covariances **/
var sales lag_sale;
run;
But how do I insert the event window for each date, store_id combination? My final output should look something like this -
id date corr cov
5927 12-Jan-07 ... ...
5927 14-Jan-07 ... ...
Here is what I've come up with:
First I convert the date to a SAS date, which is the number of days since Jan. 1 1960:
data ds;
set ds (rename=(date=old_date));
date = input(old_date, date11.);
drop old_date;
run;
Then compute lag_sale (I am using the same calculation you used in the question, but make sure this is what you want to do. For some observations the lag sale is the previous recorded date, but for some it is the same store_id and date, just a different observation.):
proc sort data=ds; by store_id; run;
data ds;
set ds;
by store_id;
lag_sale = lag(sales);
if first.store_id then lag_sale = .;
run;
Then set up the final data set:
data final;
length store_id 8 date 8 cov 8 corr 8;
if _n_ = 0;
run;
Then create a macro which takes a store_id and date and runs proc corr. The first part of the macro selects only the data with that store_id and within the past 45 days of the date. Then it runs proc corr. Then it formats proc corr how you want it and appends the results to the "final" data set.
%macro corr(store_id, date);
data ds2;
set ds;
where store_id = &store_id and %eval(&date-45) <= date <=&date
and lag_sale ne .;
run;
proc corr noprint data=ds2 cov outp=corr;
by store_id;
var sales lag_sale;
run;
data corr2;
set corr;
where _type_ in ('CORR', 'COV') and _name_ = 'sales';
retain cov;
date = &date;
if _type_ = 'COV' then cov = lag_sale;
else do;
corr = lag_sale;
output;
end;
keep store_id date corr cov;
run;
proc append base=final data=corr2 force; run;
%mend corr;
Finally run the macro for each store_id/date combination.
proc sort data=ds out=ds3 nodupkey;
by store_id date;
run;
data _null_;
set ds3;
call execute('%corr('||store_id||','||date||');');
run;
proc sort data=final;
by store_id date;
run;

Changing date format in SAS9.3

Does anyone know how to change a date variable from Date9 to MMDDYY10 format in SAS9.3? I've tried using the put and input functions, but the result is null
Formats are nothing but instructions on how to display a value. Dates are numeric represented as the number of days from 1JAN1960.
data x;
format formated1 date9. formated2 mmddyy10.;
noformated = "01JAN1960"d;
formated1 = noformated;
formated2 = noformated;
run;
proc print data=x;
run;
Obs formated1 formated2 noformated
1 01JAN1960 01/01/1960 0
In short, just change the format on the dataset and the date will be displayed with the new format.
Try both functions:
tmpdate = put(olddate,DATE9.);
newdate = input(tmpdate,MMDDYY10.);
Or maybe even
newdate = input(put(olddate,DATE9.),MMDDYY10.);
For changing the format of variable in a table - PROC SQL or PROC DATASETS:
data WORK.TABLE1;
format DATE1 DATE2 date9.;
DATE1 = today();
DATE2 = DATE1;
run;
proc contents;
run;
proc datasets lib=WORK nodetails nolist;
modify TABLE1;
format DATE1 mmddyy10.;
quit;
proc sql;
alter table WORK.TABLE1
modify DATE2 format=mmddyy10.
;
quit;
proc contents;
run;