SAS: Segmenting observations in different ranges/bins - sas

I have ~4M observations of points data, and would like to segment them into different bins like the following:
Point_Range Frequency
0-100 1000000
100-200 2000000
200-300 1000000
How would I be able to assign a range to each of the observations & output the above-like table without having to write manual "case when" or "if then" statements?

Use a custom format to map a point value to it's range representation.
Example:
data user_points;
call streaminit(20201230);
do user_id = 1 to 4e6;
points = rand('integer', 0, 300);
output;
end;
run;
proc format;
value point_range
0-<100 = ' 0 - 99'
100-<200 = '100 - 199'
200- 300 = '200 - 300'
;
run;
proc freq noprint data=user_points;
format points point_range.;
table points / out=bins;
run;
Creates data set

Related

Count the number of times record appears within a variable and apply count in a new variable

SAS - I'd like to count the number of times a record appears within a variable (Ref) and apply a count in a new variable (Count) for these.
Eg.
Ref
Count
1000
1
1001
1
2000
1
3000
1
1000
2
1000
3
What is the best way to do this?
That is what PROC FREQ is for. It will count the number of OBSERVATIONS for each value of a variable (or combination of variables).
proc freq data=have;
tables REF ;
run;
If you want the result in a dataset then use the OUT= option of the TABLES statement.
proc freq data=have;
tables REF / out=want;
run;
Managed to achive the results with the below code.
Please note - Data needs to be sorted with a PROC SORT before running this.
DATA want;
set have;
BY Variable;
IF FIRST.Variable then counter = 1
ELSE counter + 1;
RUN;
If you use the SAS hash object, you can do it with the original order intact
data have;
input Ref $;
datalines;
1000
1001
2000
3000
1000
1000
;
data want;
if _N_ = 1 then do;
dcl hash h();
h.definekey('Ref');
h.definedata('Count');
h.definedone();
end;
set have;
if h.find() ne 0 then Count = 1;
else Count + 1;
h.replace();
run;

SAS cap 1st and 99th percentiles of dataset with hundreds of variables

I have a SAS dataset with 700 columns (variables). For all 700 of them, I want to cap all values below the 1st percentile to the 1st percentile and all values above the 99th percentile to the 99th percentile. I want to do this iteratively for all 700 variables without having to specify their names explicitly.
How can I do this?
Perhaps slightly easier than the hash table - and somewhat faster, I believe - is using the horizontal output of proc means, and then using an array.
proc means data=sashelp.prdsale;
var _numeric_;
output out=quantiles p1= p99= /autoname;
run;
proc sql;
select name
into :numlist separated by ' '
from dictionary.columns
where libname='SASHELP' and memname='PRDSALE' and type='num';
quit;
data prdsale_capped;
set sashelp.prdsale;
if _n_ eq 1 then set quantiles;
array vars &numlist.;
array p1 actual_p1--month_p1;
array p99 actual_p99--month_p99;
do _i = 1 to dim(vars);
vars[_i] = max(min(vars[_i],p99[_i]),p1[_i]);
end;
run;
Basically it's just setting up three arrays - vars, p1, p99 - and then you have all 3 values for every numeric variable on the PDV and can just compare during a single array traversal.
For a production process I'd probably not use the -- but instead make 3 lists from proc sql and make 100% sure they're in the same order by using an order by.
You can do this with proc means and a hash table lookup. Let's create some test data with 100 variables pulled from a normal distribution. For testing, we'll change all the variables in the first and second rows to really big and really small numbers.
Our approach: create a lookup table where we can find the variable's name, pull its percentiles, and compare its value against those percentiles.
data have;
array var[100];
do i = 1 to 100;
do j = 1 to dim(var);
var[j] = rand('normal');
/* Test values */
if(i = 1) then var[j] = 99999;
if(i = 2) then var[j] = -99999;
end;
output;
end;
drop i j;
run;
Data:
var1 var2 var3 ...
99999 99999 99999 ...
-99999 -99999 -99999 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
Let's get all the percentiles with proc means. You might be tempted to use output out=, but it does not create the data in a vertical lookup table that's easy for us to use in this manner; however, the stackODSOutput option on proc means does. More info on this from Rick Wicklin.
We'll use ods select none so we don't render a large table but still produce the dataset that drives the table.
/* Get a dataset of all 1st and 99th percentiles for each variable */
ods select none;
proc means data=have stackODSOutput p1 p99;
var var1-var100;
ods output summary = percentiles;
run;
ods select all;
Note that all the percentiles will be the same in this case. This is expected. We set all the variables in the first and second rows to the same big and small numbers for easy testing.
Data:
Variable P1 P99 ...
var1 -50001 50001 ...
var2 -50001 50001 ...
var3 -50001 50001 ...
var4 -50001 50001 ...
... ... ... ...
Now we'll use our lookup approach. We know our variable names and we can store them in an array. We can loop through that array, look up the variable in the hash table by name with vname(), and get its percentile.
data want;
set have;
array var[*] var1-var100;
/* Load a table of these values into memory and search for each percentile.
Think of this like a simple lookup table that floats out in memory.
*/
if(_N_ = 1) then do;
length variable $32.;
dcl hash pctiles(dataset: 'percentiles');
pctiles.defineKey('variable');
pctiles.defineData('p1', 'p99');
pctiles.defineDone();
call missing(p1, p99);
end;
/* Get the 1st and 99th percentile of each variable.
If the variable's name matches the variable name
in the hash table, check the variable's value
against the lookup percentile.
Cap it if it's above or below the percentile.
*/
do i = 1 to dim(var);
if(pctiles.Find(key:vname(var[i]) ) = 0) then do;
if(var[i] < p1) then var[i] = p1;
else if(var[i] > p99) then var[i] = p99;
end;
end;
drop i variable p1 p99;
run;
Output:
var1 var2 var3 ...
50000.532908 50000.721522 99999 ...
-50000.61447 -50000.92196 -50001.19549 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
If your variables do not follow an easy sequential name, you can use the -- shortcut. For example, varA varB varC varD can be selected by varA--varD.

Is there a better way to segment a numeric column into uniform sets than Case/When?

I have a column for dollar-amount that I need to break apart into $1000 segments - so $0-$999, $1,000-$1,999, etc.
I could use Case/When, but there are an awful lot of groups I would have to make.
Is there a more efficient way to do this?
Thanks!
You could just use arithmetic. For example you could convert them to upper limit of the $1,000 range.
up_to = 1000*ceil(dollar/1000);
Let's make up some example data:
data test;
do dollar=0 to 5000 by 500 ;
up_to = 1000*ceil(dollar/1000);
output;
end;
run;
Results:
Obs dollar up_to
1 0 0
2 500 1000
3 1000 1000
4 1500 2000
5 2000 2000
6 2500 3000
7 3000 3000
8 3500 4000
9 4000 4000
10 4500 5000
11 5000 5000
Absolutely. This is a great use case for user-defined formats.
proc format;
value segment
0-<1000 = '0-1000'
1000-<2000 = '1000s'
2000-<3000 = '2000s'
;
quit;
If the number is too high to write out, do it with code!
data segments;
retain
fmtname 'segment'
type 'n' /* numeric format */
eexcl 'Y' /* exclude the "end" match, so 0-1000 excluding 1000 itself */
;
do start = 0 to 1e6 by 1000;
end = start + 1000;
label = catx('- <',start,end); * what you want this to show up as;
output;
end;
run;
proc format cntlin=segments;
quit;
Then you can use segment = put(dollaramt,segment.); to assign the value of segment, or just apply the format format dollaramt segment.; if you're just using it in PROC SUMMARY or somesuch.
And you can combine the two approaches above to generate a User Defined Format that will bin the amounts for you.
Create bins to set up a user defined format. One drawback of this method is that it requires you to know the range of data ahead of time.
Use a user defined function via PROC FCMP.
Use a manual calculation
I illustrate version of the solution for 1 & 3 below. #2 requires PROC FCMP but I think using it a plain data step can be simpler.
data thousands_format;
fmtname = 'thousands_fmt';
type = 'N';
do Start = 0 to 10000 by 1000;
END = Start + 1000 - 1;
label = catx(" - ", put(start, dollar12.0), put(end, dollar12.0));
output;
end;
run;
proc format cntlin=thousands_format;
run;
data demo;
do i=100 to 10000 by 50;
custom_format = put(i, thousands_fmt.);
manual_format = catx(" - ", put(floor(i/1000)*1000, dollar12.0), put((ceil(i/1000))*1000-1, dollar12.0));
output;
end;
run;

SAS summary statistic from a dataset

The dataset looks like this:
colx coly colz
0 1 0
0 1 1
0 1 0
Required output:
Colname value count
colx 0 3
coly 1 3
colz 0 2
colz 1 1
The following code works perfectly...
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
data freq;
retain colname column_value;
set outfreq;
colname = scan(tables, 2, ' ');
column_Value = trim(left(vvaluex(colname)));
keep colname column_value frequency percent;
run;
... but I believe that's not efficient. Say I have 1000 columns, running prof freq on all 1000 columns is not efficient. Is there any other efficient way with out using the proc freq that accomplishes my desired output?
One of the most efficient mechanisms for computing frequency counts is through a hash object set up for reference counting via the suminc tag.
The SAS documentation for "Hash Object - Maintaining Key Summaries" demonstrates the technique for a single variable. The following example goes one step further and computes for each variable specified in an array. The suminc:'one' specifies that each use of ref will add the value of one to an internal reference sum. While iterating over the distinct keys for output, the frequency count is extracted via the sum method.
* one million data values;
data have;
array v(1000);
do row = 1 to 1000;
do index = 1 to dim(v);
v(index) = ceil(100*ranuni(123));
end;
output;
end;
keep v:;
format v: 4.;
run;
* compute frequency counts via .ref();
data freak_out(keep=name value count);
length name $32 value 8;
declare hash bins(ordered:'a', suminc:'one');
bins.defineKey('name', 'value');
bins.defineData('name', 'value');
bins.defineDone();
one = 1;
do until (end_of_data);
set have end=end_of_data;
array v v1-v1000;
do index = 1 to dim(v);
name = vname(v(index));
value = v(index);
bins.ref();
end;
end;
declare hiter out('bins');
do while (out.next() = 0);
bins.sum(sum:count);
output;
end;
run;
Note Proc FREQ uses standard grammars, variables can be a mixed of character and numeric, and has lots of additional features that are specified through options.
I think the most time consuming part in your code is generation of the ODS report. You can transpose the data before applying the freq. The below example does the task for 1000 rows with 1000 variables in few seconds. If you do it using ODS it may take much longer.
data dummy;
array colNames [1000] col1-col1000;
do line = 1 to 1000;
do j = 1 to dim(colNames);
colNames[j] = int(rand("uniform")*100);
end;
output;
end;
drop j;
run;
proc transpose
data = dummy
out = dummyTransposed (drop = line rename = (_name_ = colName col1 = value))
;
var col1-col1000;
by line;
run;
proc freq data = dummyTransposed noprint;
tables colName*value / out = result(drop = percent);
run;
Perhaps this statement from the comments is the real problem.
I felt like the odsoutput with proc freq is slowing down and creating
huge logs and outputs. think of 10,000 variables and million records.
I felt there should be another way of accomplishing this and arrays
seems to be a great fit
You can tell ODS not to produce the printed output if you don't want it.
ods exclude all ;
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
ods exclude none ;

SAS Remove Outliers

I'm looking for a macro or something in SAS that can help me in isolating the outliers from a dataset. I define an outlier as: Upperbound: Q3+1.5(IQR) Lowerbound: Q1-1.5(IQR). I have the following SAS code:
title 'Fall 2015';
proc univariate data = fall2015 freq;
var enrollment_count;
histogram enrollment_count / vscale = percent vaxis = 0 to 50 by 5 midpoints = 0 to 300 by 5;
inset n mean std max min range / position = ne;
run;
I would like to get rid of the outliers from fall2015 dataset. I found some macros, but no luck in working the macro. Several have a class variable which I don't have. Any ideas how to separate my data?
Here's a macro I wrote a while ago to do this, under slightly different rules. I've modified it to meet your criteria (1.5).
Use proc means to calculate Q1/Q3 and IQR (QRANGE)
Build Macro to cap based on rules
Call macro using call execute and boundaries set, using the output from step 1.
*Calculate IQR and first/third quartiles;
proc means data=sashelp.class stackods n qrange p25 p75;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p25-1.5*qrange;
upper=p75+1.5*qrange;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;