I want to find the way to build another variable (it's ok even in the same dataset) that is the categorization of the old variable. I would choose the number of the buckets (for exemples using percentiles as cutoffs: p10, p20, p30, etc.).
Now I do this thing extracting the percentiles of the variable with proc univariate. But this give me only the percentiles (my cutoffs) and then I have to build the new variable manually using the percentiles.
How can I create this new variable giving the cutoffs and the number of buckets as input?
thanks in advance
Assuming you want equal percentage sized buckets, then PROC RANK might just get you want you are looking for.
data test;
do i=1 to 100;
output;
end;
run;
proc rank data=test out=test2 groups=5;
var i;
ranks grp;
run;
That will give you 5 groups (named 0 .. 4), which should be equivalent to P20, P40, ..., P80 cutoffs.
If you wanted non-equal buckets, ie P10, P40, P60, and P90, then you would have to choose the lowest level and combine groups. Using the groups above:
%let groups=10;
proc rank data=test out=test2 groups=&groups;
var var;
ranks grp;
run;
/*
P = (grp+1)*&groups
Cutoffs 10, 40, 60, 90
implicit 5 new groups
*/
%let n_cutoff=4;
%let cutoffs=10, 40, 60, 90;
data test3(drop=_i cutoffs:);
set test2;
array cutoffs[&n_cutoff] (&cutoffs);
P = (grp+1)*&groups;
do _i=1 to &n_cutoff;
if P <= cutoffs[_i] then do;
new_grp = _i-1;
leave;
end;
if _i = &n_cutoff then
new_grp = _i;
end;
run;
10 is the lowest common denominator of the P values. 100/10 = 10 so we need 10 groups from PROC RANK.
The Data Step at the end combines the groups using the cutoffs you are looking for.
Related
I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
I have a SAS dataset with 700 columns (variables). For all 700 of them, I want to cap all values below the 1st percentile to the 1st percentile and all values above the 99th percentile to the 99th percentile. I want to do this iteratively for all 700 variables without having to specify their names explicitly.
How can I do this?
Perhaps slightly easier than the hash table - and somewhat faster, I believe - is using the horizontal output of proc means, and then using an array.
proc means data=sashelp.prdsale;
var _numeric_;
output out=quantiles p1= p99= /autoname;
run;
proc sql;
select name
into :numlist separated by ' '
from dictionary.columns
where libname='SASHELP' and memname='PRDSALE' and type='num';
quit;
data prdsale_capped;
set sashelp.prdsale;
if _n_ eq 1 then set quantiles;
array vars &numlist.;
array p1 actual_p1--month_p1;
array p99 actual_p99--month_p99;
do _i = 1 to dim(vars);
vars[_i] = max(min(vars[_i],p99[_i]),p1[_i]);
end;
run;
Basically it's just setting up three arrays - vars, p1, p99 - and then you have all 3 values for every numeric variable on the PDV and can just compare during a single array traversal.
For a production process I'd probably not use the -- but instead make 3 lists from proc sql and make 100% sure they're in the same order by using an order by.
You can do this with proc means and a hash table lookup. Let's create some test data with 100 variables pulled from a normal distribution. For testing, we'll change all the variables in the first and second rows to really big and really small numbers.
Our approach: create a lookup table where we can find the variable's name, pull its percentiles, and compare its value against those percentiles.
data have;
array var[100];
do i = 1 to 100;
do j = 1 to dim(var);
var[j] = rand('normal');
/* Test values */
if(i = 1) then var[j] = 99999;
if(i = 2) then var[j] = -99999;
end;
output;
end;
drop i j;
run;
Data:
var1 var2 var3 ...
99999 99999 99999 ...
-99999 -99999 -99999 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
Let's get all the percentiles with proc means. You might be tempted to use output out=, but it does not create the data in a vertical lookup table that's easy for us to use in this manner; however, the stackODSOutput option on proc means does. More info on this from Rick Wicklin.
We'll use ods select none so we don't render a large table but still produce the dataset that drives the table.
/* Get a dataset of all 1st and 99th percentiles for each variable */
ods select none;
proc means data=have stackODSOutput p1 p99;
var var1-var100;
ods output summary = percentiles;
run;
ods select all;
Note that all the percentiles will be the same in this case. This is expected. We set all the variables in the first and second rows to the same big and small numbers for easy testing.
Data:
Variable P1 P99 ...
var1 -50001 50001 ...
var2 -50001 50001 ...
var3 -50001 50001 ...
var4 -50001 50001 ...
... ... ... ...
Now we'll use our lookup approach. We know our variable names and we can store them in an array. We can loop through that array, look up the variable in the hash table by name with vname(), and get its percentile.
data want;
set have;
array var[*] var1-var100;
/* Load a table of these values into memory and search for each percentile.
Think of this like a simple lookup table that floats out in memory.
*/
if(_N_ = 1) then do;
length variable $32.;
dcl hash pctiles(dataset: 'percentiles');
pctiles.defineKey('variable');
pctiles.defineData('p1', 'p99');
pctiles.defineDone();
call missing(p1, p99);
end;
/* Get the 1st and 99th percentile of each variable.
If the variable's name matches the variable name
in the hash table, check the variable's value
against the lookup percentile.
Cap it if it's above or below the percentile.
*/
do i = 1 to dim(var);
if(pctiles.Find(key:vname(var[i]) ) = 0) then do;
if(var[i] < p1) then var[i] = p1;
else if(var[i] > p99) then var[i] = p99;
end;
end;
drop i variable p1 p99;
run;
Output:
var1 var2 var3 ...
50000.532908 50000.721522 99999 ...
-50000.61447 -50000.92196 -50001.19549 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
If your variables do not follow an easy sequential name, you can use the -- shortcut. For example, varA varB varC varD can be selected by varA--varD.
Ive got 50 columns of data, with 4 different measurements in each, as well as designation tags (groups C, D, and E). Ive averaged the 4 measurements... So every data point now has an average. Now, I am supposed to take the average of all the data points averages of each specific group.
So I want all the data in group C to be averaged, and so on for D and E.... and I dont know how to do that.
avg1=(MEAS1+MEAS2+MEAS3+MEAS4)/4;
avg_score=round(avg1, .1);
run;
proc print;
run;
This is what I have so far.
There are several procedures, and SQL that can average values over a group.
I'll guess you meant to say 50 rows of data.
Example:
Proc MEANS
data have;
call streaminit(314159);
do _n_ = 1 to 50;
group = substr('CDE', rand('integer',3),1);
array v meas1-meas4;
do _i_ = 1 to dim(v);
num + 2;
v(_i_) = num;
end;
output;
end;
drop num;
run;
data rowwise_means;
set have;
avg_meas = mean (of meas:);
run;
* group wise means of row means;
proc means noprint data=rowwise_means nway;
class group;
var avg_meas;
output out=want mean=meas_grandmean;
run;
rowwise_means
want (grandmean, or mean of means)
I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;
I have about 2330000 observations, I want to assign evenly spaced 10,000 buckets. The bucket criteria would the max(var)-min(var)/10,000. So for example, my maximum value is 3000 and my minimum value is -200, So my bucket size will be (3000+200)/10,000=0.32. So any value between -200 to (-200+0.32) should go to bucket 1, and any value between (-200+0.32) to (-200+0.32*2) should go to bucket 2 and so on. The dataset will be something like this:
Var_value bucket
-200 1
-53 ?
-5 ?
-46 ?
5
8
4
56
7542
242
....
How should the code be written? I am thinking a do loop but not sure how to do it? Can anyone help?
Not sure what you would do with the proposed loop but this is what I'd do:
/* get some data to play with */
data a(keep=val);
do i=1 to 1000000;
val = 3200*ranuni(0)-200;
output;
end;
run;
/* groups=xxx specifies the number of buckets
var yyy is the name of the variable whose values we'd like to classify
ranks zzz specifies the name of the variable containing the assigned rank
*/
proc rank data=a out=b groups=10000;
var val;
ranks bucket;
run;
Below is another approach you could use:
Generate random simulated data
data have;
do i=1 to 250000;
/*Seed is `_N_` so we'll see the same random item count.*/
var_value = (ranuni(_N_)-0.5)*8000;
output;
end;
drop i;
run;
Solution(s)
/*Desired number of buckets.*/
%let num_buckets = 10000;
/*Determine bucket size and minimum var_value*/
proc sql noprint;
select (max(var_value)-min(var_value))/&num_buckets.,
min(var_value)
into : bucket_size,
: min_var_value
from have;
quit;
%put bucketsize: &bucket_size.;
%put min var_value: &min_var_value.;
/* 1 - Assign buckets using data step */
data want;
set have;
bucket = max(ceil((var_value-&min_var_value.)/&bucket_size.),1);
run;
proc sort data=want;
by bucket;
run;
/* or 2 - Assign buckets using proc sql*/
proc sql;
create table want as
select var_value,
max(ceil((var_value-&min_var_value.)/&bucket_size.),1) as bucket
from have
order by CALCULATED bucket;
quit;