SAS: Improve efficiency of a cross join - sas

In my project I am combining three unique input sources to generate one score. Imagine this formula
Integrated score = weight_1 * Score_1 + weight_2 * Score_2 + weight_3 * Score_3
So, to do this, I have utilised the following code
DATA w_matrix_t;
/*Create a row count to identify the model weight combination*/
RETAIN model_combination;
model_combination = 0;
DO n_1 = 0 TO 100 BY 1;
DO n_2 = 0 TO 100 BY 1;
IF (100 - n_1 - n_2) ge 0 AND (100 - n_1 - n_2) le 100 THEN DO;
n_3 = 100 - n_1 - n_2;
model_combination+1;
output;
END;
END;
END;
RUN;
DATA w_matrix;
SET w_matrix_t;
w_1 = n_1/100;
w_2 = n_2/100;
w_3 = n_3/100;
/*Drop the old variables*/
DROP n_1 n_2 n_3;
RUN;
PROC SQL;
CREATE TABLE weights_added AS
SELECT
w.model_combination
, w.w_1
, w.w_2
, w.w_3
, fit.name
, fit.logsalary
, (
w.w_1*fit.crhits +
w.w_2*fit.natbat +
w.w_3*fit.nbb
) AS y_hat_int
FROM
work.w_matrix AS w
CROSS JOIN
sashelp.baseball AS fit
ORDER BY
model_combination;
QUIT;
My question is, is there a more efficient way of making this join? The purpose is to create a large table that contains the entire sashelp.baseball dataset duplicated for all combinations of weights.
In my live data, I have three input sources of 46,000 observations each and that cross join takes 1 hour. I also have three input sources of 465,000 each, I imagine this will take a very long time.
The reason I do it this way is because I calculate my Somers' D using Proc freq and by group processing (by model combination)

5000 copies of a 500,000 row table will be a pretty big table with 2.5B rows
Here is an example of data step stacking; one copy of have data set per row of weights. The example features SET weights to process each weight (via implicit loop) and SET have POINT= / OUTPUT inside an explicit loop (the inner loop). The inner loop copies the data while it computes the weighted sum.
data have;
set sashelp.baseball (obs=200); * keep it small for demonstration;
run;
data weights (keep=comboId w1 w2 w3);
do i = 0 to 100; do j = 0 to 100; if (i+j) <= 100 then do;
comboId + 1;
w1 = i / 100;
w2 = j / 100;
w3 = (100 - i - j) / 100;
output;
end; end; end;
run;
data want (keep=comboid w1-w3 name logsalary y_hat_int);
do while (not endOfWeights);
set weights end = endOfWeights;
do row = 1 to RowsInHave;
set have (keep=name logsalary crhits natbat nbb) nobs = RowsInHave point = row;
y_hat_int = w1 * crhits + w2 * natbat + w3 * nbb;
output;
end;
end;
stop;
run;
proc freq data=want noprint;
by comboId;
table y_hat_int / out=freqout ;
format y_hat_int 4.;
run;
proc contents data=want;
run;
Off the cuff, a single table containing 5,151 copies of a 200 row extract from baseball is nominally 72.7MB, so expect 5,151 copies of a 465K row table to have ~ 2.4G rows and be ~ 170 GB disk. On a disk spinning #7200 achieving max performance throughout your looking at best 20 minutes just writing, and probably much more.

Related

How to replace a join with a datastep in SAS

I am trying to calculate some statistics for a given variable based on the client id and the time horizon. My current solution is show below, however, I would like to know if there is a way to reformat the code into a datastep instead of an sql join, because the join takes a very long time to execute on my real dataset.
data have1(drop=t);
id = 1;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have2(drop=t);
id = 2;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have_fin;
set have1 have2;
run;
Proc sql;
create table want1 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-3,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Proc sql;
create table want2 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-6,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Use temporary arrays and a single data step instead.
This does the same thing in a single step.
Sort data to ensure order is correct
Declare a temporary array for each set of moving average you want to calculate.
Ensure the array is empty at the start of each ID
Assign values to array in correct index. MOD() allows you to dynamically index the data without have to include a separate counter variable.
Take the average of the array. If you want the array to ignore the first two values - because it has only 1/2 data points you can conditionally calculate this as well.
*sort to ensure data is in correct order (Step 1);
proc sort data=have_fin;
by id dt;
run;
data want;
*Step 2;
array p3{0:2} _temporary_;
array p6(0:5) _temporary_;
set have_fin;
by ID;
*clear values at the start of each ID for the array Step3;
if first.ID then call missing(of p3{*}, of p6(*));
*assign the value to the array, the mod function indexes the array so it's continuously the most recent 3/6 values;
*Step 4;
p3{mod(_n_,3)} = var;
p6{mod(_n_,6)} = var;
*Step 5 - calculates statistic of interest, average in this case;
mean3d = mean(of p3(*));
mean6d = mean(of p6(*));
;
run;
And if you have SAS/ETS licensed this is super trivial.
*prints product to log - check if you have SAS/ETS licensed;
proc product_status;run;
*sorts data;
proc sort data=have_fin;
by id dt;
run;
*calculates moving average;
proc expand data=have_fin out=want_expand;
by ID;
id dt;
convert var = mean_3d / method=none transformout= (movave 3);
convert var = mean_6d / method=none transformout= (movave 6);
run;

In proc tabulate: access multiple variable values at once

I want to use proc tabulate to display the proportions/ total count of some variables conditioned on their values - if I only want to access one variable at once I was able to achieve this by setting a specific format (please check the MWE). However, if I want to access two variables at once (like score_1 >= x OR score_2 >=y) I get nowhere.
MWE:
data have;
input score_1 score_2;
datalines;
2 7
4 4
7 2
;
proc format;
value cut_fmt
low - 2 = 'non-critical'
3 - high = 'critical'
;
run;
proc tabulate data = have missing;
format score_1 score_2 cut_fmt.;
class score_1 score_2;
keylabel N = ' ' ColPCTN = '%' all = 'Total';
table score_1 * N score_2 * N, all;
run;
I now would like to have an extra row where it says:
score_1 OR score_2
non-critical 0
critical 3
Is there a way to achieve this in proc tabulate (this shouldn't be possible with a format statement I guess) or would I need to create a new variable before and just use this one then?
Create a new variable and add it to the table statement.
You will also need classdata to ensure 0 counts are presented.
Example:
data have;
input score_1 score_2;
worst_case = max(score_1, score_2);
label worst_case = 'score_1 or score_2';
datalines;
2 7
4 4
7 2
;
proc format;
value cut_fmt
low - 2 = 'non-critical'
3 - high = 'critical'
;
data combinations;
score_1 = 2;
score_2 = 2;
worst_case = 2; output;
score_1 = 3;
score_2 = 3;
worst_case = 3; output;
run;
options missing='0';
proc tabulate data = have classdata=combinations;
class score_1 score_2 worst_case;
format score_1 score_2 worst_case cut_fmt.;
keylabel N = ' ' ColPCTN = '%' all = 'Total';
table score_1 score_2 worst_case, all;
run;
options missing='.';

Left join a bucket value based on a greater than clause

I am looking to create an optimal bucketing macro. My first obstacle is to create equidistant buckets. I am using the sashelp.baseball dataset as an example.
I take the range of logsalary and divide this by 100 to create the distance between each bucket. Then I would like to assign the logsalary column a bucket value if the logsalary is smaller than the bucket value
The code I have tried is attached. I am hoping to be able to join or merge on the bucket limit values and use a greater than or smaller than clause to append a bucket value
/*Sort the baseball dataset by smallest to largest, removing any missing data*/
PROC SORT
DATA = sashelp.baseball
(KEEP = logsalary
WHERE = (NOT MISSING(logsalary)))
OUT = baseball;
BY logsalary;
RUN;
/*Identify the size of each bucket by splitting the range into 100 equidistant buckets*/
DATA _NULL_;
RETAIN bin_size;
SET baseball END = EOF;
IF _N_ = 1 THEN DO;
bin_size = logsalary;
CALL SYMPUT("min_bin",logsalary);
END;
IF EOF THEN DO;
bin_size = ((logsalary - bin_size) / 100);
CALL SYMPUT("bin_size",bin_size);
END;
RUN;
/*Create a vector to identify each bucket range*/
DATA bin_levels;
DO bin = 1 TO 100;
IF bin = 1 THEN DO;
bin_level = &min_bin.;
OUTPUT;
END;
ELSE DO;
bin_level = &min_bin. + &bin_size. * bin;
OUTPUT;
END;
END;
RUN;
/*Append a bucket number based on the logsalary being smaller than the next bucket value*/
PROC SQL;
CREATE TABLE binned_data AS
SELECT
a.*
, b.bin
, b.bin_level
FROM
baseball a
LEFT JOIN
bin_levels b ON b.bin_level > a.logsalary
;
QUIT;
I would like to see the first ten rows look like this
logSalary bin
4.2121275979 1
4.2195077052 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.3174881135 2
4.3174881135 2
4.3174881135 2
...
Thanks in advance
EDIT: for now, I am going to go with this solution
DATA bucketed_data;
RETAIN bin bin_limit;
SET baseball;
IF _n_ = 1 THEN DO;
bin_limit = logsalary;
bin = 1;
END;
IF logsalary > bin_limit THEN DO;
bin_limit + &bin_size.;
bin + 1;
END;
RUN;
No need for macro variables put the values into a dataset and combine the dataset with the one you want to bin. Let's use 10 bins instead of 100 to make it easier to examine the results.
First find the minimum and range:
proc means n min max data=sashelp.baseball;
var logsalary;
output out=stats(keep=min range) min=min range=range;
run;
Then use those to bin the data:
DATA bucketed_data;
SET sashelp.baseball (keep=logsalary);
if _n_=1 then set stats;
if not missing(logsalary) then do bin=1 to 10 while(logsalary > min+bin*(range/10));
* nothing to do here ;
end;
run;
Let's use PROC MEANS to see how it worked.
proc means n min max ;
class bin / missing;
var logsalary;
run;
Results:

SAS summary statistic from a dataset

The dataset looks like this:
colx coly colz
0 1 0
0 1 1
0 1 0
Required output:
Colname value count
colx 0 3
coly 1 3
colz 0 2
colz 1 1
The following code works perfectly...
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
data freq;
retain colname column_value;
set outfreq;
colname = scan(tables, 2, ' ');
column_Value = trim(left(vvaluex(colname)));
keep colname column_value frequency percent;
run;
... but I believe that's not efficient. Say I have 1000 columns, running prof freq on all 1000 columns is not efficient. Is there any other efficient way with out using the proc freq that accomplishes my desired output?
One of the most efficient mechanisms for computing frequency counts is through a hash object set up for reference counting via the suminc tag.
The SAS documentation for "Hash Object - Maintaining Key Summaries" demonstrates the technique for a single variable. The following example goes one step further and computes for each variable specified in an array. The suminc:'one' specifies that each use of ref will add the value of one to an internal reference sum. While iterating over the distinct keys for output, the frequency count is extracted via the sum method.
* one million data values;
data have;
array v(1000);
do row = 1 to 1000;
do index = 1 to dim(v);
v(index) = ceil(100*ranuni(123));
end;
output;
end;
keep v:;
format v: 4.;
run;
* compute frequency counts via .ref();
data freak_out(keep=name value count);
length name $32 value 8;
declare hash bins(ordered:'a', suminc:'one');
bins.defineKey('name', 'value');
bins.defineData('name', 'value');
bins.defineDone();
one = 1;
do until (end_of_data);
set have end=end_of_data;
array v v1-v1000;
do index = 1 to dim(v);
name = vname(v(index));
value = v(index);
bins.ref();
end;
end;
declare hiter out('bins');
do while (out.next() = 0);
bins.sum(sum:count);
output;
end;
run;
Note Proc FREQ uses standard grammars, variables can be a mixed of character and numeric, and has lots of additional features that are specified through options.
I think the most time consuming part in your code is generation of the ODS report. You can transpose the data before applying the freq. The below example does the task for 1000 rows with 1000 variables in few seconds. If you do it using ODS it may take much longer.
data dummy;
array colNames [1000] col1-col1000;
do line = 1 to 1000;
do j = 1 to dim(colNames);
colNames[j] = int(rand("uniform")*100);
end;
output;
end;
drop j;
run;
proc transpose
data = dummy
out = dummyTransposed (drop = line rename = (_name_ = colName col1 = value))
;
var col1-col1000;
by line;
run;
proc freq data = dummyTransposed noprint;
tables colName*value / out = result(drop = percent);
run;
Perhaps this statement from the comments is the real problem.
I felt like the odsoutput with proc freq is slowing down and creating
huge logs and outputs. think of 10,000 variables and million records.
I felt there should be another way of accomplishing this and arrays
seems to be a great fit
You can tell ODS not to produce the printed output if you don't want it.
ods exclude all ;
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
ods exclude none ;

Calculate rolling sum for one column in time interval in SAS

I have one problem and I think there is not much to correct to work right.
I have table (with desired output column 'sum_usage'):
id opt t_purchase t_spent bonus usage sum_usage
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 . 7
So, I need to sum all usage values from time_purchase (for one id, opt combination (group by id, opt) there is just one unique time_purchase) until t_spent.
Also, I have about milion rows, so hash table would be the best solution. I've tried with:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('id','opt', 't_purchase', 't_spent');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;
The fifth line from the bottom is not correct for sure (do i=intck('second', t_purchase, t_spent), but have no idea how to approach this. So, the main problem is how to set up time interval to calculate this. I have already one function in this hash table func with the same keys, but without time interval, so it would be pretty good to write this one too, but it's not necessary.
Personally, I would ditch the hash and use SQL.
Example Data:
data have;
input id $ opt
t_purchase datetime20.
t_spent datetime20.
bonus usage sum_usage;
format
t_purchase datetime20.
t_spent datetime20.;
datalines;
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 7
;
I'm leaving your sum_usage column here for comparison.
Now, create a table of sums. New value is sum_usage2.
proc sql noprint;
create table sums as
select a.id,
a.opt,
a.t_purchase,
a.t_spent,
sum(b.usage) as sum_usage2
from have as a,
have as b
where a.id = b.id
and a.opt = b.opt
and b.t_spent <= a.t_spent
and b.t_spent >= a.t_purchase
group by a.id,
a.opt,
a.t_purchase,
a.t_spent;
quit;
Now that you have the sums, join them back to the original table:
proc sql noprint;
create table want as
select a.*,
b.sum_usage2
from have as a
left join
sums as b
on a.id = b.id
and a.opt = b.opt
and a.t_spent = b.t_spent
and a.t_purchase = b.t_purchase;
quit;
This produces the table you want. Alternatively, you can use a hash to look up the values and add the sum in a Data Step (which can be faster given the size).
data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
%create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;
rc = lk.find();
drop rc;
run;
%create_hash() macro available here https://github.com/FinancialRiskGroup/SASPerformanceAnalytics
I believe this question is a morph of one your earlier ones where you compute a rolling sum by do a hash lookup for every second over a 3 hour period for each record in your data set. Hopefully you realized the simplicity of that approach has a large cost of 3*3600 hash lookups per record as well as having to load the entire data vector into a hash.
The time log data presented has new records inserted at the top of the data, and I presume the data to be descending monotonic in time.
A DATA Step can, in a single pass of monotonic data, compute the rolling sum over a time window. The technique uses 'ring' arrays, where-in index advancement is adjusted by modulus. One array is for the time and the other is for the metric (usage). The required array size is the maximum number of items that could occur within the time window.
Consider some generated sample data with time steps of 1, 2, and one jump of 200 seconds:
data have;
time = '12oct2017:11:22:32'dt;
usage = 0;
do _n_ = 1 to &have_count;
time + 2; *ceil(25*ranuni(123));
if _n_ > 30 then time + -1;
if _n_ = 145 then time + 200;
usage = floor(180*ranuni(123));
delta = time-lag(time);
output;
end;
run;
Start with the case of computing a rolling sum from prior items when sorted time ascending. (The descending case will follow):
The example parameters are RING_SIZE 16 and TIME_WINDOW of 12 seconds.
%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;
data want;
array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
array ring_time [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
retain ring_tail 0 ring_head -1 span 0 span_usage 0;
set have;
by time ; * cause error if data not sorted per algorithm requirement;
* unload from accumulated usage the tail items that fell out the window;
do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
span + -1;
span_usage + -ring_usage(ring_tail);
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
end;
ring_head = mod ( ring_head + 1, &RING_SIZE );
span + 1;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
* update the ring array;
ring_time(ring_head) = time;
ring_usage(ring_head) = usage;
span_usage + usage;
drop ring_tail ring_head span;
run;
For the case of data sorted descending, you could jiggle things; sort ascending, compute rolling and resort descending.
What to do if such a jiggle can't be done, or you just want a single pass?
The items to be part of the rolling calculation have to be from 'lead' rows, or rows not yet read via SET. How is this possible ? A second SET statement can be used to open a separate channel to the data set, and thus obtain lead values.
There is a little more bookkeeping for processing lead data -- premature overwrite and diminished window at the end of data need to be handled.
data want2;
array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
array ring_time [-1:%eval(&RING_SIZE-1)] _temporary_;
retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;
set have;
&debug put / _N_ ':' time= ring_head=;
* unload ring_head slotted item from sum;
span + -1;
span_usage + -ring_usage(ring_head);
* advance ring_head slot by 1, the vacated slot will be overwritten by lead;
ring_head = mod ( ring_head + 1, &RING_SIZE );
&debug put +2 ring_time(ring_head)= span= 'head';
* load ring with lead values via a second SET of the same data;
if not end2 then do;
do until (_n_ > 1 or lead_index = 0 or end2);
set have(keep=time usage rename=(time=t usage=u)) end=end2; * <--- the second SET ;
if end2 then guard_index = lead_index;
&debug if end2 then put guard_index=;
ring_time(lead_index) = t;
ring_usage(lead_index) = u;
&debug put +2 ring_time(lead_index)= 'lead';
lead_index = mod ( lead_index + 1, &RING_SIZE);
end;
end;
* advance ring_tail to cover the time window;
if ring_tail ne guard_index then do;
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
do while (time - ring_time(ring_tail) <= &TIME_WINDOW);
span + 1;
span_usage + ring_usage(ring_tail);
&debug put +2 ring_time(ring_tail)= span= 'seek';
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
if ring_tail_was = guard_index then leave;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
end;
* seek went beyond window, back tail off to prior index;
ring_tail = ring_tail_was;
end;
&debug put +2 ring_time(ring_tail)= span= 'mark';
drop lead_index t u ring_: guard_index span;
format ring: span: usage 6.;
run;
options source;
Confirm both methods have the same computation:
proc sort data=want2; by time;
run;
proc compare noprint data=want compare=want2 out=diff outnoequal;
id time;
var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.
I have not benchmarked ring-array versus SQL versus Proc EXPAND versus Hash.
Caution: Dead reckoning rolling values using +in and -out operations can experience round-off errors when dealing with non-integer values.