Calculate a moving sum/average depending on date - sas

My data has the following structure
data have;
infile datalines delimiter=' ';
input customerid : 8.
date : date9.
opens : 3.
;
datalines;
2123780 11APR2017 0
2123780 13APR2017 0
2123780 16APR2017 1
2123780 18APR2017 0
2123780 19APR2017 2
2123780 20APR2017 0
2123780 21APR2017 0
2123780 23APR2017 0
2123780 25APR2017 0
2123780 26APR2017 0
2123780 28APR2017 0
2123780 29APR2017 3
2123780 01MAY2017 3
2123780 03MAY2017 2
2123780 04MAY2017 5
2123780 05MAY2017 1
2123780 07MAY2017 2
2123780 09MAY2017 2
2123780 11MAY2017 3
2123780 13MAY2017 3
2123780 14MAY2017 0
2123780 16MAY2017 2
2123780 17MAY2017 2
;
run;
What I like to achieve is a moving total, average, sd, etc. for the opens variable (and many more), which contain the values which fall within the last 7, 14, 30, etc. days PRIOR to the current observation, by customerid.
As you can see, the observations occur irregularly. Sometimes there are large gaps between, sometimes there are several on the same day. Therefore, I cannot use PROC EXPAND (correct me if I'm wrong). Furthermore, I do not want to compress my date into e.g. one observation per week but keep them as they are.
The solution I came up with is an ugly piece of LAG()-coding and if-clauses. An example for one variable and 7 days:
%macro loop;
data want(drop= lag_kdnr_num -- lag_mahnung min7 -- min365 minimum);
set have;
week_opens=0;
%do i=1 %to 500;
lag_customerID=lag&i.(customerID);
date_7=lag&i.(date)+7;
lag_opens=lag&i(opens);
if ((customerID=lag_customerID) and (dsate < date_7)) then
do;
week_opens=sum(week_opens + lag_opens);
end;
%end;
min7=minimum + 7;
if date < min7 then
do;
week_opens=.;
end;
run;
%MEND;
%loop;
which is giving me this:
data want2;
infile datalines delimiter=' ';
input customerid : 8.
date : date9.
opens : 3.
week_opens : 3.
;
datalines;
2123780 11APR2017 0 .
2123780 13APR2017 0 .
2123780 16APR2017 1 .
2123780 18APR2017 0 1
2123780 19APR2017 2 1
2123780 20APR2017 0 3
2123780 21APR2017 0 3
2123780 23APR2017 0 2
2123780 25APR2017 0 2
2123780 26APR2017 0 0
2123780 28APR2017 0 0
2123780 29APR2017 3 0
2123780 01MAY2017 3 3
2123780 03MAY2017 2 6
2123780 04MAY2017 5 8
2123780 05MAY2017 1 13
2123780 07MAY2017 2 11
2123780 09MAY2017 2 10
2123780 11MAY2017 3 5
2123780 13MAY2017 3 7
2123780 14MAY2017 0 8
2123780 16MAY2017 2 6
2123780 17MAY2017 2 8
;
run;
Due to the huge amount of data, this is really slow and creates a lot of unused variables which I drop at the end.
Is there a faster, more elegant way to get this result e.g. via a temporary array or SAS/ETS?
Thank you in advance!
Final remark: I want to use this information as covariates in a survival analysis.

Related

SAS LOOP - create columns from the records which are having a value

Suppose i have random diagnostic codes, such as 001, v58, ..., 142,.. How can I construct columns from the codes which is 1 for the records?
Input:
id found code
1 1 001
2 0 v58
3 1 v58
4 1 003
5 0 v58
......
......
15000 0 v58
Output:
id code_001 code_v58 code_003 .......
1 1 0 0
2 0 0 0
3 0 1 0
4 1 0 0
5 0 0 0
.........
.........
You will want to TRANSPOSE the values and name the pivoted columns according to data (value of code) with an ID statement.
Example:
In real world data it is often the case that missing diagnoses will be flagged zero, and that has to be done in a subsequent step.
data have;
input id found code $;
datalines;
1 1 001
2 0 v58
2 1 003 /* second diagnosis result for patient 2 */
3 1 v58
4 1 003
5 0 v58
;
proc transpose data=have out=want(drop=_name_) prefix=code_;
by id;
id code; * column name becomes <prefix><code>;
var found;
run;
* missing occurs when an id was not diagnosed with a code;
* if that means the flag should be zero (for logistic modeling perhaps)
* the missings need to be changed to zeroes;
data want;
set want;
array codes code_:;
do _n_ = 1 to dim(codes); /* repurpose automatic variable _n_ for loop index */
if missing(codes(_n_)) then codes(_n_) = 0;
end;
run;

Count the number of unique ids for every subset of variables

I want to find the number of unique ids for every subset combination of the variables. For example
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
run;
I want the result to be
var1 var2 var3 count
. . 0 5
. . 1 5
. 0 . 7
. 0 0 5
. 0 1 3
. 1 . 2
. 1 1 2
0 . . 3
0 . 0 2
0 . 1 1
0 0 . 3
0 0 0 2
0 0 1 1
1 . . 7
1 . 0 4
1 . 1 4
1 0 . 5
1 0 0 4
1 0 1 2
1 1 . 2
1 1 1 2
which is the result of appending all the possible proc sql; group bys (var1 is shown below)
proc sql;
create table sub1 as
select var1, count(distinct id) as count
from have
where not missing(var1)
group by var1
;
quit;
I don't care about the case where all variables are missing or when any of the variables in the group by are missing. Is there a more efficient way of doing this?
You can use Proc SUMMARY to compute the combinations of var1-var3 values for each id by group. From the SUMMARY output a SQL query can count the distinct ids per combination.
Example:
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
proc summary noprint missing data=have;
by id;
class var1-var3;
output out=combos;
run;
proc sql;
create table want as
select var1, var2, var3, count(distinct id) as count
from combos
group by var1, var2, var3
;

Create values for group - SAS

data test;
input Index Indicator value FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;
run;
I have a data set with the first 3 columns. How do I get the 4th columns based on the indicators? For example, for the index, when the indicator =1, the value is 21, so I put 21 is the final values in all lines for index 1.
Use the SAS Retain Keyword.
You can do this in a data step; by Retaining the Value where indicator = 1.
Steps:
Sort your data by Index and Indicator
Group by the Index & Retain the Value where Indicator=1
Code:
/*Sort Data by Index and Indicator & remove the hardcodeed finalvalue*/
proc sort data=test (keep= Index Indicator value);
by index descending indicator ;
run;
/*Retain the FinalValue*/
data want;
set test;
retain FinalValue;
keep Index Indicator value FinalValue;
if indicator =1 then do;FinalValue=value;end;
/*The If statement below will assign . to records that doesn't have an indicator value of 1*/
if indicator ne 1 and FIRST.Index=1 then FinalValue=.;
by index;
run;
Output:
Index=1 Indicator=1 value=21 FinalValue=21
Index=1 Indicator=0 value=5 FinalValue=21
Index=2 Indicator=1 value=0 FinalValue=0
Index=3 Indicator=1 value=7 FinalValue=7
Index=3 Indicator=0 value=4 FinalValue=7
Index=3 Indicator=0 value=8 FinalValue=7
Index=3 Indicator=0 value=2 FinalValue=7
Index=4 Indicator=1 value=1 FinalValue=1
Index=4 Indicator=0 value=4 FinalValue=1
Use proc sql by left join. Select value which indicator=1 and group by index, then left join with original dataset. It seemed that your first row of index=3 should be 7, not 0.
proc sql;
select a.*,b.finalvalue from test a
left join (select *,value as finalvalue from test group by index having indicator=1) b
on a.index=b.index;
quit;
This is rather old school but should be adequate. I reckon you call it a self merge or something.
data test;
input Index Indicator value;* FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;;;;
run;
data final;
if 0 then set test;
merge test(where=(indicator eq 1) rename=(value=FinalValue)) test;
by index;
run;
proc print;
run;
Final
Obs Index Indicator value Value
1 1 0 5 21
2 1 1 21 21
3 2 1 0 0
4 3 0 4 7
5 3 1 7 7
6 3 0 8 7
7 3 0 2 7
8 4 1 1 1
9 4 0 4 1

SAS: How to select obs with condition

My data look like this:
rep model x Reject
1 1 1.36 1
1 2 -0.76 0
1 3 3.74 1
1 4 -0.42 0
2 1 -0.56 0
2 2 -5.78 0
2 3 -2.00 0
2 4 -3.67 0
and i want output look like this:
rep model x Reject
1 1 1.36 1
2 1 -0.56 0
I want just 1 from 4 model where Reject=1 but if it can't find,every Obs could be.
Thanks!
Sort your data by REP and REJECT and take first record per REP.
Proc sort data=have;
By rep descending reject model;
Run;
Data select;
Set have;
By rep descending reject model;
If first.rep;
Run;

SAS : proc freq on 3 categorical variables

I have data with
One binary variable, poor
Two socio-demographic variables var1 and var2
I would like to have the poverty rate of each of my var1 * var2 possible value, that would look like that :
But with three variables in a proc freq, I get multiple outputs, one for each value of the first variable I put on my product
proc freq data=test;
table var1*var2*poor;
run;
How can I get something close to what I would like ?
Try this
data test;
input var1 var2 poor;
cards;
1 1 1
2 3 0
3 2 1
4 1 1
1 2 1
2 3 0
4 1 0
4 2 0
3 1 1
1 2 0
3 2 0
1 3 1
3 3 0
3 3 0
3 3 1
1 1 0
2 2 0
2 2 1
2 2 1
2 1 1
2 1 1
2 1 1
;
run;
proc tabulate data=test;
class var1 var2 poor;
tables var1,
var2*poor*pctn<poor>={label="%"};
run;