Consecutive Months Data Counting in SAS - sas

I have a typical banking data and need some help.
There are 3 columns: Account ID, Month Key(yyyymm format) and Payment Type. Payment Type can take values IO,IOA,PIF,PI,P,NFD,Null.
I have around 250,000 accounts and objective is to find such accounts that have Payment_Type in ("IO","IOA") for consecutive 60+ months. Discontinuous 60 months in IO is not my objective.
data have;
length Account_ID $2.
Month_Key 8.
Payment_Type $3.
;
format Month_Key date9.;
input Account_ID$ Month_Key:yymmn. Payment_Type$;
datalines;
A1 201001
A1 201002 IO
A1 201003 PIF
A1 201004 PI
A1 201005 P
A1 201006
A1 201007 IOA
A1 201008 IO
A1 201009 IOA
A1 201010 IOA
A1 201011 IO
A1 201012 IO
A1 201101 IO
A1 201102 IO
A1 201103 IO
A1 201104 IO
A1 201105 IO
A1 201106 IO
A1 201107 IO
A1 201108 IO
A1 201109 IO
A1 201110 IO
A1 201111 IO
A1 201112 IO
A1 201201 IO
A1 201202 IO
A1 201203 IO
A1 201204 IO
A1 201205 IO
A1 201206 IO
A1 201207 IO
A1 201208 IO
A1 201209 IO
A1 201210 IO
A1 201211 IO
A1 201212 IO
A1 201301 IO
A1 201302 IO
A1 201303 IO
A1 201304 IO
A1 201305 IO
A1 201306 IO
A1 201307 IO
A1 201308 IO
A1 201309 IO
A1 201310 IO
A1 201311 IO
A1 201312 IO
A1 201401 IO
A1 201402 IO
A1 201403 IO
A1 201404 IO
A1 201405 IO
A1 201406 IO
A1 201407 IO
A1 201408 IO
A1 201409 IO
A1 201410 IO
A1 201411 IO
A1 201412 IO
A1 201501 IO
A1 201502 IO
A1 201503 IO
A1 201504 IO
A1 201505 IO
A1 201506 IO
A1 201507 IO
A1 201508 IO
A1 201509 PIF
A1 201510 PIF
A1 201511 PIF
A1 201512 PIF
A1 201601 PIF
A1 201602 PIF
A1 201603 PIF
;
run;
This account is in IO for a period of 62 consecutive months starting from 201007 ending at 201508.
My final output should have Account ID, and an indicator stating whether account is in IO > 60+ Months. Better to create an indicator with value 1 such as below if 60+ months in IO, else 0.
Account_ID IO_GT_60_Mths_Ind
A1         1
Can someone please help me. Appreciate!!

Welcome to Stack Overflow! You can accomplish this using a data step, by-group processing, and the sum statement. The below code will increment n by 1. We will reset the counter if:
We reach a new account
The payment type is not IO or IOA
The number of months between the current month and previous month is > 1
Code:
proc sort data=have;
by account_id month_key;
run;
data want;
set have;
by account_id month_key;
lag_month = lag(month_key);
if(first.account_id) then call missing(lag_month);
if( first.account_id
OR intck('month', lag_month, month_key) > 1
OR payment_type NOT IN('IO', 'IOA')
)
then n = 0;
n+1;
IO_GT_60_Months_Ind = (n GE 60);
format lag_month date9.;
run;
Your question is clear, but normally we would ask for sample code and what you have tried. Please be sure to format your data in datalines or a downloadable csv, and post your attempt the next time.

My 2 cents
data want(keep=Account_ID IO_GT_60_Mths_Ind);
IO_GT_60_Mths_Ind = 0;
do until (last.Account_ID);
set have;
by Account_ID notsorted;
c = ifn(Payment_Type in ("IO", "IOA"), c+1, 0);
if c = 60 then IO_GT_60_Mths_Ind = 1;
end;
output;
run;

Related

Create semi-cumulative columns based off several other columns. SAS

I've got some data which is essentially lots of columns of information/data and dates and then two columns of numbers and a column which is a flag (ie its either a 1 or a 0). Each row is information on an individual at a particular month.
For the two columns of numbers I want to create two new columns which are the cumulative numbers for each individual over time. And for the flag I want it to be 1 for all future dates for that individual once it has first become 1 for that individual.
I'm struggling to word this (and so also to google what I want to do!) so I've put what I have and what I want below. In this example: A1, B1, C1 would be one individual and A1, B2, C3 would be another individual.
I've got this:
Col1
Col2
Col3
Date
Value_1
Value_2
Flag
A1
B1
C1
01Jan2021
0
100
0
A1
B1
C1
01Feb2021
0
0
0
A1
B1
C1
01Mar2021
10
100
0
A1
B1
C1
01Apr2021
50
0
0
A1
B1
C1
01May2021
0
10
1
A1
B1
C1
01Jun2021
10
0
0
A1
B1
C1
01Jul2021
0
0
0
A1
B2
C3
01Jan2021
0
0
0
A1
B2
C3
01Feb2021
0
20
1
A1
B2
C3
01Mar2021
10
20
0
A1
B2
C3
01Apr2021
40
20
0
A1
B2
C3
01May2021
0
0
0
A1
B2
C3
01Jun2021
30
0
0
A1
B2
C3
01Jul2021
0
0
0
And I want this:
Col1
Col2
Col3
Date
Value_1_full
Value_2_full
Flag
A1
B1
C1
01Jan2021
0
100
0
A1
B1
C1
01Feb2021
0
100
0
A1
B1
C1
01Mar2021
10
200
0
A1
B1
C1
01Apr2021
60
200
0
A1
B1
C1
01May2021
60
210
1
A1
B1
C1
01Jun2021
70
210
1
A1
B1
C1
01Jul2021
70
210
1
A1
B2
C3
01Jan2021
0
0
0
A1
B2
C3
01Feb2021
0
20
1
A1
B2
C3
01Mar2021
10
40
1
A1
B2
C3
01Apr2021
50
60
1
A1
B2
C3
01May2021
50
60
1
A1
B2
C3
01Jun2021
80
60
1
A1
B2
C3
01Jul2021
80
60
1
I could do this if the only data I had was for a single individual, but there's lots of them. The code I've written is just giving me the total cumulative of the column - I can't figure out how to calculate them separately for each individual. I'm also struggling to write the code for the flag column for a similar reason. I've put the code below and would be very appreciative of any help/advice.
Note: I'm really new to SAS and to write this question I've struggled to get the date field in correctly by just typing out the data for this example (I've used this "Ignore" bit of the code below as a work around to get it into SAS) so if you could let me know what I've done wrong here that would also be greatly appreciated for the future!
data data_1;
input Col1 $ Col2 $ Col3 $ Date date8. Ignore Value_1 Value_2 Flag;
format Date date8.;
datalines;
A1 B1 C1 "'01Jan2021'd" 0 100 0
A1 B1 C1 "'01Feb2021'd" 0 0 0
A1 B1 C1 "'01Mar2021'd" 10 100 0
A1 B1 C1 "'01Apr2021'd" 50 0 0
A1 B1 C1 "'01May2021'd" 0 10 1
A1 B1 C1 "'01Jun2021'd" 10 0 0
A1 B1 C1 "'01Jul2021'd" 0 0 0
A1 B2 C3 "'01Jan2021'd" 0 0 0
A1 B2 C3 "'01Feb2021'd" 0 20 1
A1 B2 C3 "'01Mar2021'd" 10 20 0
A1 B2 C3 "'01Apr2021'd" 40 20 0
A1 B2 C3 "'01May2021'd" 0 0 0
A1 B2 C3 "'01Jun2021'd" 30 0 0
A1 B2 C3 "'01Jul2021'd" 0 0 0
;
run;
Data data_2;
set data_1;
drop Ignore;
run;
proc sort data=data_2
out=data_3;
by Col1 Col2 Col3 Date;
run;
data data_4;
set data_3;
by Col1 Col2 Col3 Date;
retain Col1 Col2 Col3 Date Value_1 Value_2 Flag Value_1_full Value_2_full;
if first.Col1 AND first.Col2 AND first.Col3 AND first.Date then Value_1_full = Value_1;
else Value_1_full = Value_1_full + Value_1;
run;
So you're pretty close! I think this gets there...
proc sort data=data_1(drop=ignore)
out=data_3;
by Col1 Col2 Col3 Date;
run;
data data_4;
set data_3;
by Col1 Col2 Col3 Date;
retain Col1 Col2 Col3 Date Value_1 Value_2 Flag Value_1_full Value_2_full;
if first.Col3 then Value_1_full = Value_1;
else Value_1_full = Value_1_full + Value_1;
if first.col3 then flag=0;
flag = max(flag,flag_Early);
run;
Only a few small changes. I removed one pointless data step (The drop can be done in any of the other places you use the data) and change the if first. to be if first.col3.
You don't need col2 and col1 - first.col3 is what you care about, the other two changing would also cause first.col3 to also be true by default.
you also don't want First.date there - first.date is true EVERY TIME the date changes (or any other variable before it in the by), and that happens on every row, so it is always true! You don't want that.
Finally, for flag you need to make a new variable. Old variables are in fact always retained! But they're also replaced every iteration with new values. So we rename it to flag_early or whatever you like, and use the max function to assign a 1 to flag any time flag_early has a 1 or keep the 1 in flag if it has it from before - again resetting it every time first.col3 is true.

Combine data from multiple rows in SAS if multiple ids including adding some vars

Hope you can help with an solution, either a SQL or data step.
I need to combine multiple rows if customer id is the same, and add some vars with code too.
I have following static variable containers:
%let FirstColSuffix=<Somecode1>
%let SecondColSuffix=#<SomeCode2>
%let ThirdColSuffix=#<SomeCode3>
Data have;
Customerid Firstcol Secondcol Thirdcol
1 A1 A2 A3
2 B1 B2 B3
2 C1 C2 C3
2 D1 D2 D3
3 E1 E2 E3
3 F1 F2 F3
3 G1 G2 G3
3 H1 H2 H3
Data want;
Customerid Firstcol Secondcol Thirdcol Result
1 A1 A2 A3 A1<SomeCode1>A2#<SomeCode2>A3#<SomeCode3>
2 B1 B2 B3 B1<SomeCode1>B2#<SomeCode2>B3#<SomeCode3>
2 C1 C2 C3 B1<SomeCode1>B2#<SomeCode2>B3#<SomeCode3>C1<SomeCode1>C2#<SomeCode2>C3#<SomeCode3>
2 D1 D2 D3 B1<SomeCode1>B2#<SomeCode2>B3#<SomeCode3>C1<SomeCode1>C2#<SomeCode2>C3#<SomeCode3>D1<SomeCode1>D2#<SomeCode2>D3#<SomeCode3>
3 E1 E2 E3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>
3 F1 F2 F3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>F1<SomeCode1>F2#<SomeCode2>F3#<SomeCode3>
3 G1 G2 G3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>F1<SomeCode1>F2#<SomeCode2>F3#<SomeCode3>G1<SomeCode1>G2#<SomeCode2>G3#<SomeCode3>
3 H1 H2 H3 E1<SomeCode1>E2#<SomeCode2>E3#<SomeCode3>F1<SomeCode1>F2#<SomeCode2>F3#<SomeCode3>G1<SomeCode1>G2#<SomeCode2>G3#<SomeCode3>H1<SomeCode1>H2#<SomeCode2>H3#<SomeCode3>
I only need output if last customer id (but with data from all matching customer id outputted in last row in column "result".
So in this example I need the line 1, 4 and 8
Can anyone help? :-)
Use retain and by-group processing. We'll continually concatenate result to itself for each row we read and carry that value forward. At the last customer ID, we'll output. At the first customer ID, result is reset.
data want;
set have;
by Customerid;
length Result $500.;
retain Result;
if(first.Customerid) then call missing(Result);
Result = cats(Result, FirstCol, "&FirstColSuffix", SecondCol, "&SecondColSuffix", ThirdCol, "&ThirdColSuffix");
if(last.Customerid);
run;
Output:

Sorting variables and proc report

I have a dataset of the following form:
GROUP 1 GROUP 2 TOTAL
A 400
A a1 100
A a2 100
A a3 300
B 300
B b1 400
B b2 200
C 350
C c1 100
C c2 500
GROUP 1 and GROUP 2 are character variables and TOTAL is a numeric variable. Character variables are sorted alphabetically but not by the variable TOTAL.
I would like to have it sorted within groups (GROUP 1 first) by decreasing frequency (TOTAL variable). If the same groups have the same frequency, then alphabetical order applies. So the output should look like this:
GROUP 1 GROUP 2 TOTAL
A 400
A a3 300
A a1 100
A a2 100
C 350
C c2 500
C c1 100
B 300
B b1 400
B b2 200
Is there a quick way of doing this inside proc report procedure without messing with the initial dataset? Or even if this is not possible, is there a quick method to sort it appropriately in an efficient way? The only way coming up to my mind is to sort it separately for every group and then merge the sorted datasets, it takes too much time.
You just need to make sure you have all of the things to sort by on every row. In this case it's just two things you need to add: drop down that total or whatever that is that is on group2=' ' onto every other row for that Group1, and then identify those top rows to keep them up top. Then you can sort it properly.
PROC REPORT might be able to do this as well with the same want dataset, but without code showing what you're doing it's hard to provide that - but the concept is basically identical.
data have;
input GROUP1 $ GROUP2 $ TOTAL;
datalines;
A . 400
A a1 100
A a2 100
A a3 300
B . 300
B b1 400
B b2 200
C . 350
C c1 100
C c2 500
;;;;
run;
data for_sort;
set have;
retain total_group;
if missing(group2) then total_group=total;
if missing(group2) then topgroup = 1;
else topgroup = 2;
run;
proc sort data=for_sort out=want;
by descending total_group group1 topgroup descending total group2;
run;

How to use SAS to count the frequency of each observation in a column as in R

I simply want to know the total frequency of each variable as in R table(). Can I do that in SAS?
I have a SAS dataset as following.
data level_score;
infile datalines;
input ID $ Level $ SCORE;
return;
datalines;
1 A2 0.2
2 A3 0.8
3 A4 0.3
4 A5 0.2
5 A6 0.2
6 A3 0.6
7 A4 0.2
8 A5 0.6
9 A6 0.2
;
run;
proc print data=level_score;
run;
I want to use SAS to know the frequency of Level and SCORE as in R table()
For variable 'Level'
A2 A3 A4 A5 A6
1 2 2 2 2
For variable 'SCORE'
0.2 0.3 0.6 0.8
5 1 2 1
The easiest way is to use proc freq as you found out.
proc freq data=level_score;
table Level;
run;
There are however several other ways to count frequencies. Here are just two of them.
Showing frequencies using proc sql
proc sql;
select Level,
count(*) as Freq
from level_score
group by Level;
quit;
Results:
Level Freq
A2 1
A3 2
A4 2
A5 2
A6 2
Show frequencies on the log using a data step
* First we need to sort the data by the variable of interest;
proc sort data=level_score out=Level_sorted;
by Level;
run;
* Then we use the `by` statement with a retain variable, ;
* here called "count". ;
data _null_;
set Level_sorted;
by Level;
count + 1;
if last.Level then do;
put "Frequency for Level " Level ": " count;
count = 0;
end;
run;
The log shows:
Frequency for Level A2 : 1
Frequency for Level A3 : 2
Frequency for Level A4 : 2
Frequency for Level A5 : 2
Frequency for Level A6 : 2
The latter example can easily be modified to generate a dataset containing the frequencies:
data Level_freqa;
set level_sorted;
by Level;
count + 1;
if last.Level then do;
output;
count = 0;
end;
drop ID SCORE;
run;

how to beat computational complexity in call execute(catt(datastep))

in a datastep of this kind
ID VAR_1 VAR_2 VAR_3 ...
1 a1 b1 mv ...
2 a2 b2 mv ...
3 a3 b3 c3 ...
4 a4 mv mv ...
5 a5 b5 mv ...
6 a6 b6 mv ...
where the number of the variables are not known (i want to generalize as more as possible my code) I want to obtain a dataset like this (something like an inverted proc transpose):
ID VAR
1 a1
1 b1
2 a2
2 b2
3 a3
3 b3
3 c3
....
So i'm splitting the dataset in a nonfixed number of temp datasets, which one contains ID and only one column, trashing observation with missing values, then I'll merge all these temporary datasets obtaining my result. And this works.
But the call execute has a very high computational complexity, I mean, if I try to do this operation in a dataset with only one column (dropping missing values) my garbage computer takes 0.1 secs, while using a call execute in a dataset with 6 columns it won't take 0.1*6=0.6 secs, It will take some minutes. This because it won't work in column but in row, and this is SAS and I must get over it. But I'm asking myself (and now I'm asking to you) if there are some other ways for obtaining my results without this computational time. Here a focus on the code:
data _null_;
set old;
array try[*] VAR: ;
do i=1 to DIM(try);
call execute(catt("data var",i,"; set old; if var_",i," = ' ' then delete; allvarnew= col",i,"; ` `drop COL:; run;" ));
end;
run;
columns are char $1 (ID is char $4).
columns are the result of a proc transpose.
thanks.
I'm not sure of the efficiency of this, but it requires only one data-step as opposed to the multiple data-steps in the call execute approach described:
data new (drop=var_: i);
set test;
array try[*] VAR_: ;
do i=1 to DIM(try);
var=try[i]; output;
end;
run;