SAS: binning data

SAS: binning data - sas

data scores;
length variables $ 16;
input variables $ low high score;
datalines;
Debt -10000 1 55
Debt 1 10000 23
MAX_NA -1 1 500
MAX_NA 1 100 -240
;
data main_data;
input ID Debt MAX_NA;
datalines;
222554 7584 12
212552 20 0
883123 500 7
913464 -200 -78
;
data end_result;
input ID Debt MAX_NA score;
datalines;
222554 7584 12 -217
212552 20 0 523
883123 500 7 -185
913464 -200 -78 555
;
Above you'll find three data sets.
The scores data sets depict each variables' score, based on a range of values between low and high columns.
The second data set main_data shows the exact values of Debt and MAX_NA.
end_result table is what I would like to achieve.
What step and statements should I use to calculate the score and get the end_result table?

Another apprach is to use a double left join like so:
data scores;
length variables $ 16;
input variables $ low high score;
datalines;
Debt -10000 1 55
Debt 1 10000 23
MAX_NA -1 1 500
MAX_NA 1 100 -240
;
data main_data;
input ID Debt MAX_NA;
sortseq = _n_;
datalines;
222554 7584 12
212552 20 0
883123 500 7
913464 -200 -78
;
proc sql;
create table end_result as
select a.ID
,a.Debt
,a.MAX_NA
,coalesce(b.score,0) + coalesce(c.score,0) as score
from main_data as a
left join scores(where=(variables="Debt")) as b
on b.low < a.Debt <= b.high
left join scores(where=(variables="MAX_NA")) as c
on c.low < a.MAX_NA <= c.high
order by a.sortseq
;
quit;
Note that I have included a sortseq variable in main_data to keep the sorting order.
Like draycut I get the same score for id 222554 and 883123. For ID 913464 the MAX_NA value is out of range of the scores dataset, so I have counted it as zero by using the coalesce funtion. I therefore get the results:
ID Debt MAX_NA score
222554 7584 12 -217
212552 20 0 523
883123 500 7 -217
913464 -200 -78 55

Simpler:
data end_result(keep=ID Debt MAX_NA score);
set main_data;
score = 0;
do i = 1 to n;
set scores(rename=score=s) point=i nobs=n;
if variables = "Debt" and low <= Debt <= high then score + s;
else if variables = "MAX_NA" and low <= MAX_NA <= high then score + s;
end;
run;

I don't understand why id 222554 and 883123 do not get the same score?
Anyway, here is an approach you can use as a template.
data end_result;
if _N_ = 1 then do;
dcl hash h(dataset : "scores(rename=score=s)", multidata : "Y");
h.definekey("variables");
h.definedata(all : "Y");
h.definedone();
dcl hiter hi("h");
end;
set main_data;
if 0 then set scores(rename=score=s);
score = 0;
do while (hi.next() = 0);
if variables = "Debt" and low <= Debt <= high then score + s;
else if variables = "MAX_NA" and low <= MAX_NA <= high then score + s;
end;
keep id Debt max_na score;
run;
Result:
ID Debt MAX_NA score
222554 7584 12 -217
212552 20 0 523
883123 500 7 -217
913464 -200 -78 555

Related

SAS problem: sum up rows and divide till it reach a specific value

I have the following problem, I would like to sum up a column and divide the sum every line through the sum of the whole column till a specific value is reached. so in Pseudocode it would look like that:
data;
set auto;
sum_of_whole_column = sum(price);
subtotal[i] = 0;
i =1;
do until (subtotal[i] = 70000)
subtotal[i] = (subtotal[i] + subtotal[i+1])/sum_of_whole_column
i = i+1
end;
run;
I get the error that I haven't defined an array... so can I use something else instead of subtotal[i]?and how can I put a column in an array? I tried but it doesn't work (data = auto and price the column I want to put into an array)
data invent_array;
set auto;
array price_array {1} price;
run;
EDIT: maybe the dataset I used is helpful :)
DATA auto ;
LENGTH make $ 20 ;
INPUT make $ 1-17 price mpg rep78 ;
CARDS;
AMC Concord 4099 22 3
AMC Pacer 4749 17 3
Audi 5000 9690 17 5
Audi Fox 6295 23 3
BMW 320i 9735 25 4
Buick Century 4816 20 3
Buick Electra 7827 15 4
Buick LeSabre 5788 18 3
Cad. Eldorado 14500 14 2
Olds Starfire 4195 24 1
Olds Toronado 10371 16 3
Plym. Volare 4060 18 2
Pont. Catalina 5798 18 4
Pont. Firebird 4934 18 1
Pont. Grand Prix 5222 19 3
Pont. Le Mans 4723 19 3
;
RUN;

Perhaps I am missing your point but your subtotal will never be equal to 70 000 if you divide by the sum of its column. The maximum value will be 1. Your incremental sum however can be equal or superior to 70 000.
data stage1;
retain _sum 0;
set auto;
_sum = sum(_sum, price);
if _sum < 70000 then output;
run;
proc sql;
create table want as
select t1.*, t1._sum/sum(price) as subtotal
from stage1 as t1;
quit;
subtotal
0.0607268256
0.1310834235
0.2746411058
0.3679017467
0.5121261056
0.5834753107
0.6994325842
0.7851820027
1

How to calulate a cumulative sum for each observation by id

My problem is about calculating the cumulative sum for each id and for each date taking into account a sliding period of 15 previous days. If the cumulative sum exceeds 10k, the variable top is incremented.
The treatment is done for Juen only.
Here is an exemple of the desired result : 
id app_date price cum top 
1 29-juin-20 4000              4 000 .
1 13-juin-20 5000            45 000 1
1 13-juin-20 6000            40 000 2
1 11-juin-20 7000            34 000 3
1 10-juin-20 8000            27 000 4
1 01-juin-20 9000            19 000 5
1 30-mai-20 10000            10 000 .
proc sort data = tab out= tab1;
by id descending app_date;
run;
data tab2;
set tab1;
%let annee=2020;
%let month=06;
by  id;
retain last_date date_last_d CUM;
if first.id then do;
      last_date =app_date;
      date_last_dem = app_date;
      CUM=0;
end;
if month(date_last_d) =&month. then do ;
diff= date_last_d -app_date;
CUM= price+ CUM;
end;
if diff>15 then do;
      diff = .;
      CUM =.;
      last_date =app_date;
      date_last_d = app_date;
end;
if last.id and CUM>10000 then top= top+1 ;
output;
last_date=app_date;
format last_date DDMMYY10.;
format date_last_d DDMMYY10.;
format CUM 14.2;
run;
I can do it for the first iteration but I cannot do it for all the lines.

How about this?
data have;
input Cnt Price ID App_date :ddmmyy10.;
format App_date ddmmyy10.;
datalines;
1 2265 534 30/05/2020
2 2330 4594 27/06/2020
3 1360 723 14/05/2020
4 1393 723 14/05/2020
5 2400 101666 12/06/2020
6 2411 101666 12/06/2020
7 2400 101666 11/06/2020
8 2400 101666 11/06/2020
9 2527 101666 10/06/2020
10 2536 101666 10/06/2020
11 2458 101666 04/06/2020
12 2758 1088 30/05/2020
13 4412 1056 13/06/2020
14 1870 1255 30/06/2020
15 4198 1255 14/05/2020
;
data want(drop = c k p dt);
dcl hash h(ordered : "Y");
h.definekey("c");
h.definedata("c", "p", "dt");
h.definedone();
dcl hiter i("h");
do c = 1 by 1 until (last.ID);
set have(rename=(App_Date=dt Price=p));
by ID notsorted;
h.add();
end;
do k = 1 by 1 until (last.ID);
set have;
by ID notsorted;
cum = 0;
do while (i.next() = 0);
if App_Date - 15 <= dt <= App_Date & k <= c then cum + p;
end;
if cum > 10000 then top + 1;
else top = .;
output;
end;
h.clear();
run;

Create table with frequency buckets in Base SAS

Below is a sample of my dataset:
City Days
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
I would like to break down "Days" into buckets of 0-30, 30-90, 90-180, 180+, such that the "buckets" are along the x-axis of the table, and the cities are along the y-axis.
I tried using PROC FREQ, but I don't have SAS/STAT. Is there any way to do this in base SAS?

I believe this is what you want. This is most certainly a "brute force" approach, but I think its outlines the concept correctly.
data have;
length city $9;
input city dayscount;
cards;
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
;
run;
data want;
set have;
if dayscount >= 0 and dayscount <=30 then '0-30'n = dayscount;
if dayscount >= 30 and dayscount <=90 then '30-90'n = dayscount;
if dayscount >= 90 and dayscount <=180 then '90-180'n = dayscount;
if dayscount > 180 then '180+'n = dayscount;
drop dayscount;
run;

One of the ways for solving this problem is by using Proc Format for assigning the value bucket and then using Proc Transpose for the desired result:
data city_day_split;
length city $12.;
input city dayscount;
cards;
atlanta 10
tampa 95
atlanta 100
charlotte 20
charlotte 31
tampa 185
;
run;
/****Assigning the buckets****/
proc format;
value buckets
0 - <30 = '0-30'
30 - <90 = '30-90'
90 - <180 = '90-180'
180 - high = 'gte180'
;
run;
data city_day_split;
set city_day_split;
day_bucket = put(dayscount,buckets.);
run;
proc sort data=city_day_split out=city_day_split;
by city;
run;
/****Making the Buckets as columns, City as rows and daycount as Value****/
proc transpose data=city_day_split out=city_day_split_1(drop=_name_);
by city;
id day_bucket;
var dayscount;
run;
My Output:
> **city |0-30 |90-180 |30-90 |GTE180**
> Atlanta |10 |100 |. |.
> Charlotte |20 |. |31 |.
> Tampa |. |95 |. |185

SAS maximum value in preceding rows

I need to calculate max (Measure) in the last 3 months for each ID and month, without using PROC SQL.I was wondering I could do this using the RETAIN statement, however I have no idea how to implement the condition of comparing the value of Measure in the current row and the preceding two.
I will also need to prepare the above for more than 3 months so any solution that do not require a separate step for each additional month would be absolutely appreciated!
Here is the data I have:
data have;
input month ID $ measure;
cards;
201501 A 0
201502 A 30
201503 A 60
201504 A 90
201505 A 0
201506 A 0
201501 B 0
201502 B 30
201503 B 0
201504 B 30
201505 B 60
;
Here the one I need:
data want;
input month ID $ measure max_measure_3m;
cards;
201501 A 0 0
201502 A 30 30
201503 A 60 60
201504 A 90 90
201505 A 0 90
201506 A 0 90
201501 B 0 0
201502 B 30 30
201503 B 0 30
201504 B 30 30
201505 B 60 60
;
And here both tables: the one I have on the left and the one I need on the right

You can do this with an array that's size to your moving window. I'm not sure what type of dynamic code you need in terms of windows. If you need the max for a 4 or 5 month on top of 3 month then I would recommend using PROC EXPAND instead of these methods. The documentation for PROC EXPAND has a good example of how to do this.
data want;
set have;
by id;
array _prev(0:2) _temporary_;
if first.id then
do;
call missing (of _prev(*));
count=0;
end;
count+1;
_prev(mod(count, 3))=measure;
max=max(of _prev(*));
drop count;
run;
proc expand data=test out=out method=none;
by id;
id month;
convert x = x_movave3 / transformout=(movave 3);
convert x = x_movave4 / transformout=(movave 4);
run;

Try this:
data want(drop=l1 l2 cnt tmp);
set have;
by id;
retain cnt max_measure_3m l1 l2;
if first.id then do;
max_measure_3m = 0;
cnt = 0;
l1 = .;
l2 = .;
end;
cnt = cnt + 1;
tmp = lag(measure);
if cnt > 1 then
l1 = tmp;
tmp = lag2(measure);
if cnt > 2 then
l2 = tmp;
if measure > l1 and measure > l2 then
max_measure_3m = measure;
run;

SAS: Compute value of column under an ACROSS variable (Nested/Derived/Pseudo-Column)

I can't seem to include a computed variable in a PROC REPORT. It works fine when the computed variable is a headline column, but when it forms part of an ACROSS group, I can't get it to work. I've only got so far as to be able to reference the columns direcly, which only gives me the result for a single ACROSS group, not both.
data have1;
input username $ betdate : datetime. stake winnings winner;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 0
player1 04NOV2008:09:03:44 100 40 1
player2 07NOV2008:14:03:33 120 -120 0
player1 05NOV2008:09:00:00 50 15 1
player1 05NOV2008:09:05:00 30 5 1
player1 05NOV2008:09:00:05 20 10 1
player2 09NOV2008:10:05:10 10 -10 0
player2 09NOV2008:10:05:40 15 -15 0
player2 09NOV2008:10:05:45 15 -15 0
player2 09NOV2008:10:05:45 15 45 1
player2 15NOV2008:15:05:33 35 -35 0
player1 15NOV2008:15:05:33 35 15 1
player1 15NOV2008:15:05:33 35 15 1
run;
PROC PRINT; RUN;
Proc rank data=have1 ties=mean out=ranksout1 groups=2;
var stake winner;
ranks stakeRank winnerRank;
run;
PROC REPORT DATA=ranksout1 NOWINDOWS out=report;
COLUMN stakerank winnerrank, (N stake=stakemean discountedstake);
DEFINE stakerank / GROUP '' ORDER=INTERNAL;
DEFINE winnerrank / ACROSS '' ORDER=INTERNAL;
DEFINE stake / analysis sum noprint;
DEFINE stakemean / analysis sum;
DEFINE discountedstake / computed format=8.2 'discountedstake';
COMPUTE discountedstake;
_C4_ = _C3_ -1;
ENDCOMP;
RUN;
I don't understand how a variable connected to an across group can be calculated. This only calculates the value of 'discountedstake' for column 'C4' and it doesn't make sense to do it again for column 7.
How can I include the value of that computed variable in each group?

PROC REPORT DATA=ranksout1 NOWINDOWS out=report;
COLUMN stakerank winnerrank, (N stake=stakemean discountedstake);
DEFINE stakerank / GROUP '' ORDER=INTERNAL;
DEFINE winnerrank / ACROSS '' ORDER=INTERNAL;
DEFINE stake / analysis sum noprint;
DEFINE stakemean / analysis sum;
DEFINE discountedstake / computed format=8.2 'discountedstake';
COMPUTE discountedstake;
_C4_ = _C3_ -1;
_C7_ = _C6_ -1;
ENDCOMP;
RUN;
You just need to mention each column you want calculated. You might be able to do this with an array if you have many of them, or do it in a data step/view ahead of time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS: binning data - sas

Simpler: data end_result(keep=ID Debt MAX_NA score); set main_data; score = 0; do i = 1 to n; set scores(rename=score=s) point=i nobs=n; if variables = "Debt" and low <= Debt <= high then score + s; else if variables = "MAX_NA" and low <= MAX_NA <= high then score + s; end; run;

Related

SAS problem: sum up rows and divide till it reach a specific value

How to calulate a cumulative sum for each observation by id

Create table with frequency buckets in Base SAS

SAS maximum value in preceding rows

SAS: Compute value of column under an ACROSS variable (Nested/Derived/Pseudo-Column)

Categories

Resources