Left join a bucket value based on a greater than clause - sas

I am looking to create an optimal bucketing macro. My first obstacle is to create equidistant buckets. I am using the sashelp.baseball dataset as an example.
I take the range of logsalary and divide this by 100 to create the distance between each bucket. Then I would like to assign the logsalary column a bucket value if the logsalary is smaller than the bucket value
The code I have tried is attached. I am hoping to be able to join or merge on the bucket limit values and use a greater than or smaller than clause to append a bucket value
/*Sort the baseball dataset by smallest to largest, removing any missing data*/
PROC SORT
DATA = sashelp.baseball
(KEEP = logsalary
WHERE = (NOT MISSING(logsalary)))
OUT = baseball;
BY logsalary;
RUN;
/*Identify the size of each bucket by splitting the range into 100 equidistant buckets*/
DATA _NULL_;
RETAIN bin_size;
SET baseball END = EOF;
IF _N_ = 1 THEN DO;
bin_size = logsalary;
CALL SYMPUT("min_bin",logsalary);
END;
IF EOF THEN DO;
bin_size = ((logsalary - bin_size) / 100);
CALL SYMPUT("bin_size",bin_size);
END;
RUN;
/*Create a vector to identify each bucket range*/
DATA bin_levels;
DO bin = 1 TO 100;
IF bin = 1 THEN DO;
bin_level = &min_bin.;
OUTPUT;
END;
ELSE DO;
bin_level = &min_bin. + &bin_size. * bin;
OUTPUT;
END;
END;
RUN;
/*Append a bucket number based on the logsalary being smaller than the next bucket value*/
PROC SQL;
CREATE TABLE binned_data AS
SELECT
a.*
, b.bin
, b.bin_level
FROM
baseball a
LEFT JOIN
bin_levels b ON b.bin_level > a.logsalary
;
QUIT;
I would like to see the first ten rows look like this
logSalary bin
4.2121275979 1
4.2195077052 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.3174881135 2
4.3174881135 2
4.3174881135 2
...
Thanks in advance
EDIT: for now, I am going to go with this solution
DATA bucketed_data;
RETAIN bin bin_limit;
SET baseball;
IF _n_ = 1 THEN DO;
bin_limit = logsalary;
bin = 1;
END;
IF logsalary > bin_limit THEN DO;
bin_limit + &bin_size.;
bin + 1;
END;
RUN;

No need for macro variables put the values into a dataset and combine the dataset with the one you want to bin. Let's use 10 bins instead of 100 to make it easier to examine the results.
First find the minimum and range:
proc means n min max data=sashelp.baseball;
var logsalary;
output out=stats(keep=min range) min=min range=range;
run;
Then use those to bin the data:
DATA bucketed_data;
SET sashelp.baseball (keep=logsalary);
if _n_=1 then set stats;
if not missing(logsalary) then do bin=1 to 10 while(logsalary > min+bin*(range/10));
* nothing to do here ;
end;
run;
Let's use PROC MEANS to see how it worked.
proc means n min max ;
class bin / missing;
var logsalary;
run;
Results:

Related

SAS: proc hpbin function

The data I have is
Year Score
2020 100
2020 45
2020 82
.
.
.
2020 91
2020 14
2020 35
And the output I want is
Score_Ranking Count_Percent Cumulative_count_percent Sum
top100 x y z
101-200
.
.
.
800-900
900-989
The dataset has a total of 989 observations for the same year. I want to divide the whole dataset into 10 bins but set the size to 100. However, if I use the proc hpbin function, my results get divided into 989/10 bins. Is there a way I can determine the bin size?
Also, I want additional rows that show proportion, cumulative proportion, and the sum of the scores. How can I print these next to the bins?
Thank you in advance.
Sort your data
Classify into bins
Use PROC FREQ for #/Cumulative Count
Use PROC FREQ for SUM by using WEIGHT
Merge results
Or do 3-4 in same data step.
I'm not actually sure what the first two columns will tell you as they will all be the same except for the last one.
First generate some fake data to work with, the sort is important!
*generate fake data;
data have;
do score=1 to 998;
output;
end;
run;
proc sort data=have;
by score;
run;
Method #1
Note that I use a view here, not a data set which can help if efficiency may be an issue.
*create bins;
data binned / view=binned;
set have ;
if mod(_n_, 100) = 1 then bin+1;
run;
*calculate counts/percentages;
proc freq data=binned noprint;
table bin / out=binned_counts outcum;
run;
*calculate sums - not addition of WEIGHT;
proc freq data=binned noprint;
table bin / out=binned_sum outcum;
weight score;
run;
*merge results together;
data want_merged;
merge binned_counts binned_sum (keep = bin count rename = count= sum);
by bin;
run;
Method #2
And another method, which requires a single pass of your data rather than multiple as in the PROC FREQ approach:
*manual approach;
data want;
set have
nobs = _nobs /*Total number of observations in data set*/
End=last /*flag for last record*/;
*holds values across rows and sets initial value;
retain bin 1 count cum_count cum_sum 0 percent cum_percent ;
*increments bins and resets count at start of each 100;
if mod(_n_, 100) = 1 and _n_ ne 1 then do;
*output only when end of bin;
output;
bin+1;
count=0;
end;
*increment counters and calculate percents;
count+1;
percent = count / _nobs;
cum_count + 1;
cum_percent = cum_count / _nobs;
cum_sum + score;
*output last record/final stats;
if last then output;
*format percents;
format percent cum_percent percent12.1;
run;

Frequency of entire table in SAS

Is it possible to get the frequency of an entire table in SAS? For example I want to count how many yes's or no's are in an entire table? Thanks
A hash component object has keys and can track .FIND references in key summary variable specified with the keysum: tag attribute supplied at instantiation. The keysum variable, when incremented by 1 per suminc: variable will compute a frequency count.
data have;
* Words array from Abstract;
* "How Do I Love Hash Tables? Let Me Count The Ways!";
* by Judy Loren, Health Dialog Analytic Solutions;
* SGF 2008 - Beyond the Basics;
* https://support.sas.com/resources/papers/proceedings/pdfs/sgf2008/029-2008.pdf;
array words(17) $10 _temporary_ (
'I' 'love' 'hash' 'tables'
'You' 'will' 'too' 'after' 'you' 'see'
'what' 'they' 'can' 'do' '--' 'Judy' 'Loren'
);
call streaminit(123);
do row = 1 to 127;
attrib RESPONSE1-RESPONSE20 length = $10;
array RESPONSE RESPONSE1-RESPONSE20;
do over RESPONSE;
RESPONSE = words(rand('integer', 1, dim(words)));
end;
output;
end;
run;
data _null_;
set have;
if _n_ = 1 then do;
length term $10;
call missing (term);
retain one 1;
retain count 0;
declare hash bins(suminc:'one', keysum:'count');
bins.defineKey('term');
bins.defineData('term');
bins.defineDone();
end;
set have end=lastrow;
array response response1-response20;
do over response;
if bins.find(key:response) ne 0 then do;
bins.add(key:response, data:response, data:1);
end;
end;
if lastrow;
bins.output(dataset:'all_freq');
run;
Original answer, presumed only Yes and No
Yes. You can array values, compute as 0/1 flag for each No/Yes value and then use SUM to count 0's and 1's. SUM computes FREQ only when dealing with just 0's and 1's.
Example:
data have;
call streaminit(123);
do row = 1 to 100;
attrib ANSWER1-ANSWER20 length = $3;
array ANSWER ANSWER1-ANSWER20;
do over ANSWER; ANSWER = ifc(rand('uniform') > 0.15,'Yes','No'); end;
output;
end;
run;
data want(keep=freq_1 freq_0);
set have end=lastrow;
array ANSWER ANSWER1-ANSWER20;
array X(20) _temporary_;
do over ANSWER; x(_I_) = ANSWER = 'Yes'; end;
freq_1 + sum (of X(*));
freq_0 + dim(X) - sum (of X(*));
if lastrow;
run;
Transpose your main data and then do a proc freq. This is fully dynamic and scales if the number of question scales or the responses scale. You do need to have all variables be of the same type though - character or numeric.
*generate fake data;
data have;
call streaminit(99);
array q(30) q1-q30;
do i=1 to 100;
do j=1 to dim(q);
q(j) = rand('bernoulli', 0.8);
end;
output;
end;
run;
*flip it to a long format;
proc transpose data=have out=long;
by I;
var q1-q30;
run;
*get the summaries needed;
proc freq data=long;
table col1;
run;
You should get output as follows:
The FREQ Procedure
COL1 Frequency Percent Cumulative
Frequency Cumulative
Percent
0 581 19.37 581 19.37
1 2419 80.63 3000 100.00

How do I create a SAS data view from an 'Out=' option?

I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;

SAS summary statistic from a dataset

The dataset looks like this:
colx coly colz
0 1 0
0 1 1
0 1 0
Required output:
Colname value count
colx 0 3
coly 1 3
colz 0 2
colz 1 1
The following code works perfectly...
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
data freq;
retain colname column_value;
set outfreq;
colname = scan(tables, 2, ' ');
column_Value = trim(left(vvaluex(colname)));
keep colname column_value frequency percent;
run;
... but I believe that's not efficient. Say I have 1000 columns, running prof freq on all 1000 columns is not efficient. Is there any other efficient way with out using the proc freq that accomplishes my desired output?
One of the most efficient mechanisms for computing frequency counts is through a hash object set up for reference counting via the suminc tag.
The SAS documentation for "Hash Object - Maintaining Key Summaries" demonstrates the technique for a single variable. The following example goes one step further and computes for each variable specified in an array. The suminc:'one' specifies that each use of ref will add the value of one to an internal reference sum. While iterating over the distinct keys for output, the frequency count is extracted via the sum method.
* one million data values;
data have;
array v(1000);
do row = 1 to 1000;
do index = 1 to dim(v);
v(index) = ceil(100*ranuni(123));
end;
output;
end;
keep v:;
format v: 4.;
run;
* compute frequency counts via .ref();
data freak_out(keep=name value count);
length name $32 value 8;
declare hash bins(ordered:'a', suminc:'one');
bins.defineKey('name', 'value');
bins.defineData('name', 'value');
bins.defineDone();
one = 1;
do until (end_of_data);
set have end=end_of_data;
array v v1-v1000;
do index = 1 to dim(v);
name = vname(v(index));
value = v(index);
bins.ref();
end;
end;
declare hiter out('bins');
do while (out.next() = 0);
bins.sum(sum:count);
output;
end;
run;
Note Proc FREQ uses standard grammars, variables can be a mixed of character and numeric, and has lots of additional features that are specified through options.
I think the most time consuming part in your code is generation of the ODS report. You can transpose the data before applying the freq. The below example does the task for 1000 rows with 1000 variables in few seconds. If you do it using ODS it may take much longer.
data dummy;
array colNames [1000] col1-col1000;
do line = 1 to 1000;
do j = 1 to dim(colNames);
colNames[j] = int(rand("uniform")*100);
end;
output;
end;
drop j;
run;
proc transpose
data = dummy
out = dummyTransposed (drop = line rename = (_name_ = colName col1 = value))
;
var col1-col1000;
by line;
run;
proc freq data = dummyTransposed noprint;
tables colName*value / out = result(drop = percent);
run;
Perhaps this statement from the comments is the real problem.
I felt like the odsoutput with proc freq is slowing down and creating
huge logs and outputs. think of 10,000 variables and million records.
I felt there should be another way of accomplishing this and arrays
seems to be a great fit
You can tell ODS not to produce the printed output if you don't want it.
ods exclude all ;
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
ods exclude none ;

Do loop to assign bucket

I have about 2330000 observations, I want to assign evenly spaced 10,000 buckets. The bucket criteria would the max(var)-min(var)/10,000. So for example, my maximum value is 3000 and my minimum value is -200, So my bucket size will be (3000+200)/10,000=0.32. So any value between -200 to (-200+0.32) should go to bucket 1, and any value between (-200+0.32) to (-200+0.32*2) should go to bucket 2 and so on. The dataset will be something like this:
Var_value bucket
-200 1
-53 ?
-5 ?
-46 ?
5
8
4
56
7542
242
....
How should the code be written? I am thinking a do loop but not sure how to do it? Can anyone help?
Not sure what you would do with the proposed loop but this is what I'd do:
/* get some data to play with */
data a(keep=val);
do i=1 to 1000000;
val = 3200*ranuni(0)-200;
output;
end;
run;
/* groups=xxx specifies the number of buckets
var yyy is the name of the variable whose values we'd like to classify
ranks zzz specifies the name of the variable containing the assigned rank
*/
proc rank data=a out=b groups=10000;
var val;
ranks bucket;
run;
Below is another approach you could use:
Generate random simulated data
data have;
do i=1 to 250000;
/*Seed is `_N_` so we'll see the same random item count.*/
var_value = (ranuni(_N_)-0.5)*8000;
output;
end;
drop i;
run;
Solution(s)
/*Desired number of buckets.*/
%let num_buckets = 10000;
/*Determine bucket size and minimum var_value*/
proc sql noprint;
select (max(var_value)-min(var_value))/&num_buckets.,
min(var_value)
into : bucket_size,
: min_var_value
from have;
quit;
%put bucketsize: &bucket_size.;
%put min var_value: &min_var_value.;
/* 1 - Assign buckets using data step */
data want;
set have;
bucket = max(ceil((var_value-&min_var_value.)/&bucket_size.),1);
run;
proc sort data=want;
by bucket;
run;
/* or 2 - Assign buckets using proc sql*/
proc sql;
create table want as
select var_value,
max(ceil((var_value-&min_var_value.)/&bucket_size.),1) as bucket
from have
order by CALCULATED bucket;
quit;