SAS: Conditional sum by category - sas

Anyone have a good solution to the below? I need to sum down observations and take the value of 'dist' when the cumulative sum of 'total' reaches 1000;
data DATA ;
input ID $ dist total ;
cards ;
A 1.5 600
A 2.5 500
A 3.0 200
B 2.8 1050
B 6.8 100
C 0.8 900
C 1.2 150
C 3.5 300
; run;
Desired output with the third column being optional:
A 2.5 1100
B 2.8 1050
C 1.2 1050

By-processing with retained cumulative total and output flag :
data want ;
set have ;
by ID ; /* assumes your data is sorted already by ID */
retain cumtot _out . ;
if first.ID then call missing(cumtot,_out) ;
cumtot + total ;
if cumtot >= 1000 and not _out then do ;
_out = 1 ; /* set flag so we don't output further records for this ID */
output ;
end ;
drop _: ;
run ;
I'd also refrain from naming a dataset "data".

Related

Inner product between a table and a column in sas

I would like to calculate an inner product between a table (matrix) and a column (vector). Below are the sample datasets.
DATA tempdf ;
input v1 v2 v3 ;
datalines ;
1 2 3
2 4 6
3 6 9
4 8 12
5 10 15
;
RUN ;
DATA testcoef ;
length Variable $3. ;
input Variable $ coef ;
datalines ;
v1 0.1
v2 0.2
v3 0.3
;
RUN ;
I want to calculate it by v1*0.1 + v2*0.2 + v3*0.3 by each row. And the final result will be look like :
1.4
2.8
4.2
5.6
7
as a column.
Which respectively calculated by
1*0.1 + 2*0.2 + 3*0.3 = 1.4
2*0.1 + 4*0.2 + 6*0.3 = 2.8
3*0.1 + 6*0.2 + 9*0.3 = 4.2
4*0.1 + 8*0.2 + 12*0.3 = 5.6
5*0.1 + 10*0.2 + 15*0.3 = 7
THANKS.
I have try to proc transpose the tempdf datasets and merge the coef columns in testcoef dataset then do array over all columns by multiplying with coef column, and eventually sum all the columns up.
But the process will be very slow if the dataset is large in rows, I wonder if there is any smarter or faster way to do it.
PROC TRANSPOSE data = tempdf out = temptrans name = Variable;
var _all_ ;
RUN ;
PROC SQL ;
create table trans_coef as
select a.*, b.Coef
from temptrans as a
left join testcoef as b
on a.Variable = b.Variable ;
QUIT ;
DATA out1 ;
set trans_coef ;
array colarr COL: ;
do over colarr ;
colarr = colarr * coef ;
end ;
RUN ;
PROC MEANS data = out1 sum;
var col: ;
output out = out1_score(drop = _TYPE_ _FREQ_) sum = ;
RUN;
PROC TRANSPOSE data = out1_score out = final_out name = Cust;
var COL: ;
RUN ;
final_out table will be like :
| Cust | COL1
-----------------
1 | COL1 | 1.4
2 | COL2 | 2.8
3 | COL3 | 4.2
4 | COL4 | 5.6
5 | COL5 | 7
PROC SCORE.
DATA tempdf ;
input v1 v2 v3 ;
datalines ;
1 2 3
2 4 6
3 6 9
4 8 12
5 10 15
;
RUN ;
DATA testcoef ;
retain _TYPE_ 'SCORE';
length Variable $32;
input Variable v1-v3;
rename variable=_NAME_;
datalines ;
New 0.1 0.2 0.3
;
RUN ;
proc print;
run;
proc score data=tempdf score=testcoef out=t3;
var V:;
run;
proc print;
run;
See if this works for you
proc sql noprint;
select count(*) into :d separated by ' '
from testcoef;
quit;
data want(keep = value);
if _N_ = 1 then do i = 1 by 1 until (z);
set testcoef end = z;
array c {&d.} _temporary_;
c{i} = coef;
end;
set tempdf;
array v v1 - v&d.;
do over v;
value = sum(value, v*c{_i_});
end;
run;
Result:
value
1.4
2.8
4.2
5.6
7.0

Is there a better way to segment a numeric column into uniform sets than Case/When?

I have a column for dollar-amount that I need to break apart into $1000 segments - so $0-$999, $1,000-$1,999, etc.
I could use Case/When, but there are an awful lot of groups I would have to make.
Is there a more efficient way to do this?
Thanks!
You could just use arithmetic. For example you could convert them to upper limit of the $1,000 range.
up_to = 1000*ceil(dollar/1000);
Let's make up some example data:
data test;
do dollar=0 to 5000 by 500 ;
up_to = 1000*ceil(dollar/1000);
output;
end;
run;
Results:
Obs dollar up_to
1 0 0
2 500 1000
3 1000 1000
4 1500 2000
5 2000 2000
6 2500 3000
7 3000 3000
8 3500 4000
9 4000 4000
10 4500 5000
11 5000 5000
Absolutely. This is a great use case for user-defined formats.
proc format;
value segment
0-<1000 = '0-1000'
1000-<2000 = '1000s'
2000-<3000 = '2000s'
;
quit;
If the number is too high to write out, do it with code!
data segments;
retain
fmtname 'segment'
type 'n' /* numeric format */
eexcl 'Y' /* exclude the "end" match, so 0-1000 excluding 1000 itself */
;
do start = 0 to 1e6 by 1000;
end = start + 1000;
label = catx('- <',start,end); * what you want this to show up as;
output;
end;
run;
proc format cntlin=segments;
quit;
Then you can use segment = put(dollaramt,segment.); to assign the value of segment, or just apply the format format dollaramt segment.; if you're just using it in PROC SUMMARY or somesuch.
And you can combine the two approaches above to generate a User Defined Format that will bin the amounts for you.
Create bins to set up a user defined format. One drawback of this method is that it requires you to know the range of data ahead of time.
Use a user defined function via PROC FCMP.
Use a manual calculation
I illustrate version of the solution for 1 & 3 below. #2 requires PROC FCMP but I think using it a plain data step can be simpler.
data thousands_format;
fmtname = 'thousands_fmt';
type = 'N';
do Start = 0 to 10000 by 1000;
END = Start + 1000 - 1;
label = catx(" - ", put(start, dollar12.0), put(end, dollar12.0));
output;
end;
run;
proc format cntlin=thousands_format;
run;
data demo;
do i=100 to 10000 by 50;
custom_format = put(i, thousands_fmt.);
manual_format = catx(" - ", put(floor(i/1000)*1000, dollar12.0), put((ceil(i/1000))*1000-1, dollar12.0));
output;
end;
run;

Multiple operations on a single value in SAS?

I'm trying to create a column that will apply to different interests to it based on how much each customer's cumulative purchases are. Not sure but I was thinking that I'd need to use a do while statement but entirely sure. :S
This is what I got so far but I don't know how to get it to perform two operations on one value. Such that, it will apply one interest rate until say, 4000, and then apply the other interest rate to the rest above 4000.
data cards;
set sortedccards;
by Cust_ID;
if first.Cust_ID then cp=0;
cp+Purchase;
if cp<=4000 then cb=(cp*.2);
if cp>4000 then cb=(cp*.2)+(cp*.1);
format cp dollar10.2 cp dollar10.2;
run;
What I'd like my output to look like.
You will want to also track the prior cumulative purchase in order to detect when a purchase causes the cumulative to cross the threshold (or breakpoint) $4,000. Breakpoint crossing purchases would be split into pre and post portions for different bonus rates.
Example:
Program flow causes retained variable pcp to act like a LAGged variable.
data have;
input id $ p;
datalines;
C001 1000
C001 2300
C001 2000
C001 1500
C001 800
C002 6200
C002 800
C002 300
C003 2200
C003 1700
C003 2500
C003 600
;
data want;
set have;
by id;
if first.id then do;
cp = 0;
pcp = 0; retain pcp; /* prior cumulative purchase */
end;
cp + p; /* sum statement causes cp to be implicitly retained */
* break point is 4,000;
if (cp > 4000 and pcp > 4000) then do;
* entire purchase in post breakpoint territory;
b = 0.01 * p;
end;
else
if (cp > 4000) then do;
* split purchase into pre and post breakpoint portions;
b = 0.10 * (4000 - pcp) + 0.01 * (p - (4000 - pcp));
end;
else do;
* entire purchase in pre breakpoint territory;
b = 0.10 * p;
end;
* update prior for next implicit iteration;
pcp = cp;
run;
Here is a fairly straightforward solution which is not optimized but works. We calculate the cumulative purchases and cumulative bonus at each step (which can be done quite simply), and then calculate the current period bonus as cumulative bonus minus previous cumulative bonus.
This is assuming that the percentage is 20% up to $4000 and 30% over $4000.
data have;
input id $ period MMDDYY10. purchase;
datalines;
C001 01/25/2019 1000
C001 02/25/2019 2300
C001 03/25/2019 2000
C001 04/25/2019 1500
C001 05/25/2019 800
C002 03/25/2019 6200
C002 04/25/2019 800
C002 05/25/2019 300
C003 02/25/2019 2200
C003 03/25/2019 1700
C003 04/25/2019 2500
C003 05/25/2019 600
;
run;
data want (drop=cumul_bonus);
set have;
by id;
retain cumul_purchase cumul_bonus;
if first.id then call missing(cumul_purchase,cumul_bonus);
** Calculate total cumulative purchase including current purchase **;
cumul_purchase + purchase;
** Calculate total cumulative bonus including current purchase **;
cumul_bonus = (0.2 * cumul_purchase) + ifn(cumul_purchase > 4000, 0.1 * (cumul_purchase - 4000), 0);
** Bonus for current purchase = total cumulative bonus - previous cumulative bonus **;
bonus = ifn(first.id,cumul_bonus,dif(cumul_bonus));
format period MMDDYY10.
purchase cumul_purchase bonus DOLLAR10.2
;
run;
proc print data=want;

Is there a way to simply transpose data in SAS

I'm a newbie to SAS. I am trying to document the table structure of the 50+ data sets and so I want to just take the top 5 rows from each data set and output it on console. However, since many of these data sets have many columns I would like to transpose them. I tried to use proc transpose but apparently it doesn't just flip the results and keeps dropping columns.
For example, the following code only produce results with MSGID and LINENO only...
proc print data=sashelp.smemsg;
run;
proc transpose data=sashelp.smemsg out=work.test;
run;
proc print data=work.test;
run;
Update:
I think it didn't work because SAS doesn't know how to "normalize" the data types after the transformation. I would like to something similar to this in R where all numbers became string.
> df <- data.frame(x=11:20, y=letters[1:10])
> df
x y
1 11 a
2 12 b
3 13 c
4 14 d
5 15 e
6 16 f
7 17 g
8 18 h
9 19 i
10 20 j
> t(df)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
y "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
To quickly look at the data in SAS dataset I normally just use a PUT statement and look at the log.
data _null_;
set have (obs=5);
put (_all_) (=/);
run;
If you just want to transpose the data then use PROC TRANSPOSE. You need to specify the variables or you will only get the numeric ones.
proc transpose data=have (obs=5) out=want ;
var _all_ ;
run;
proc print data=want ;
run;
Here is roughly how to do it.
Generate sample dataset having 25 rows, 6 numeric vars and 6 string vars
data sample;
array num_col_(6);
array str_col_(6) $;
do row_number = 1 to 25;
do col_number = 1 to 6;
num_col_(col_number) = round(ranuni(0),.01);
str_col_(col_number) = byte(ceil(ranuni(0)*10)+97);
end;
output;
end;
drop row_number col_number;
run;
Transpose data, keeping only 5 first rows
proc transpose data=sample(obs=5) prefix=row
out=sample_tr(rename=(_name_=column));
var num_col_: str_col_:;
/* You could also use keywords on the var statement */
* var _character_ _numeric_; * Lets you decide which type to show first;
* var _all_; * Keeps original order of variables;
run;
Show the results
proc print data=sample_tr noobs;
id column;
var row1-row5;
run;
Results
column row1 row2 row3 row4 row5
--------- ---- ---- ---- ---- ----
num_col_1 0.66 0.96 0.85 0.45 0.32
num_col_2 0.78 0.79 0.64 0.85 0.74
num_col_3 0.23 0.62 0.46 0.46 0.51
num_col_4 0.91 0.15 0.16 0.77 0.13
num_col_5 0.6 0.48 0.32 0.6 0.77
num_col_6 0.13 0.76 0.67 0.16 0.67
str_col_1 c i i i c
str_col_2 j k f f c
str_col_3 e g k h i
str_col_4 b h d k e
str_col_5 c h f e f
str_col_6 i b k i f

Ranking values based on another data set in SAS

Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)
Here's a simplified example of data sets A, B and want (the desired output):
A:
obs_A VAR1 VAR2 VAR3
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
B:
obs_B VAR1 VAR2 VAR3
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
want:
obs VAR1 VAR2 VAR3
1 2 2 3
2 2 4 2
3 4 3 1
4 5 4 3
5 6 6 6
I come up with a macro loop that involves PROC RANK and PROC APPEND like below:
%macro MyRank(A,B);
data AB; set &A &B; run;
%do i=1 %to 5;
proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
var VAR1-3;
run;
proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
%end;
%mend;
This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.
Thanks in advance for suggestions.
I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.
First, reading in the A dataset:
data ranking_vals;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;;;;
run;
Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).
data for_ranking;
set ranking_vals;
array var[3];
do _i = 1 to dim(var);
var_name = vname(var[_i]);
var_value = var[_i];
output;
end;
run;
proc sort data=for_ranking;
by var_name var_value;
run;
Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.
data for_fmt;
set for_ranking;
by var_name var_value;
retain prev_value;
if first.var_name then do; *initialize things for a new varname;
rank=0;
prev_value=.;
hlo='l'; *first record has 'minimum' as starting point;
end;
rank+1;
fmtname=cats(var_name,'F');
start=prev_value;
end=var_value;
label=rank;
output;
if last.var_name then do; *For last record, some special stuff;
start=var_value;
end=.;
hlo='h';
label=rank+1;
output; * Output that 'high' record;
start=.;
end=.;
label=.;
hlo='o';
output; * And a "invalid" record, though this should never happen;
end;
prev_value=var_value; * Store the value for next row.;
run;
proc format cntlin=for_fmt;
quit;
And then we test it out.
data test_b;
input obs_B VAR1 VAR2 VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
;;;;
run;
One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.
proc sql;
create table want as
select
B.obs_B,
(
select count(distinct A.Var1) + 1
from A
where A.var1 <= B.var1.
) as var1
from B;
quit;
Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.
%macro rankBy(
inScore /*Dataset containing data to be ranked*/,
inLookup /*Dataset containing data against which to rank*/,
varID /*Variable by which to identify an observation*/,
varsRank /*Space separated list of variable names to be ranked*/,
outData /*Output dataset name*/);
/* Rank variables in one dataset by identically named variables in another */
proc sql;
create table &outData. as
select
scr.&varID.
/* Loop through each variable to be ranked */
%do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
/* Store the variable name in a macro variable */
%let var = %scan(&varsRank., &i., %str( ));
/* Rank: count all the rows with lower value in lookup */
, (
select count(distinct lkp&i..&var.) + 1
from &inLookup. as lkp&i.
where lkp&i..&var. <= scr.&var.
) as &var.
%end;
from &inScore. as scr;
quit;
%mend rankBy;
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.
In rough testing on a slow PC I saw:
A: 1e1 B: 1e6 time: ~1s
A: 1e2 B: 1e6 time: ~2s
A: 1e3 B: 1e6 time: ~5s
A: 1e1 B: 1e7 time: ~10s
A: 1e2 B: 1e7 time: ~12s
A: 1e4 B: 1e6 time: ~30s
Edit:
As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.
Edit:
Runtime test rig:
data A;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;
run;
data B;
do obs_B = 1 to 1e7;
VAR1 = ceil(rand("uniform")* 60);
VAR2 = ceil(rand("uniform")* 500);
VAR3 = ceil(rand("uniform")* 6000);
output;
end;
run;
%let start = %sysfunc(time());
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;
Output:
0:00:12.41