SAS: Replace rare levels in variable with new level "Other" - sas

I've got pretty big table where I want to replace rare values (for this example that have less than 10 occurancies but real case is more complicated- it might have 1000 levels while I want to have only 15). This list of possible levels might change so I don't want to hardcode anything.
My code is like:
%let var = Make;
proc sql;
create table stage1_ as
select &var.,
count(*) as count
from sashelp.cars
group by &var.
having count >= 10
order by count desc
;
quit;
/* Join table with table including only top obs to replace rare
values with "other" category */
proc sql;
create table stage2_ as
select t1.*,
case when t2.&var. is missing then "Other_&var." else t1.&var. end as &var._new
from sashelp.cars t1 left join
stage1_ t2 on t1.&var. = t2.&var.
;
quit;
/* Drop old variable and rename the new as old */
data result;
set stage2_(drop= &var.);
rename &var._new=&var.;
run;
It works, but unfortunately it is not very officient as it needs to make a join for each variable (in real case I am doing it in loop).
Is there a better way to do it? Maybe some smart replace function?
Thanks!!

You probably don't want to change the actual data values. Instead consider creating a custom format for each variable that will map the rare values to an 'Other' category.
The FREQ procedure ODS can be used to capture the counts and percentages of every variable listed into a single table. NOTE: Freq table/out= captures only the last listed variable. Those counts can be used to construct the format according to the 'othering' rules you want to implement.
data have;
do row = 1 to 1000;
array x x1-x10;
do over x;
if row < 600
then x = ceil(100*ranuni(123));
else x = ceil(150*ranuni(123));
end;
output;
end;
run;
ods output onewayfreqs=counts;
proc freq data=have ;
table x1-x10;
run;
data count_stack;
length name $32;
set counts;
array x x1-x10;
do over x;
name = vname(x);
value = x;
if value then output;
end;
keep name value frequency;
run;
proc sort data=count_stack;
by name descending frequency ;
run;
data cntlin;
do _n_ = 1 by 1 until (last.name);
set count_stack;
by name;
length fmtname $32;
fmtname = trim(name)||'top';
start = value;
label = cats(value);
if _n_ < 11 then output;
end;
hlo = 'O';
label = 'Other';
output;
run;
proc format cntlin=cntlin;
run;
ods html;
proc freq data=have;
table x1-x10;
format
x1 x1top.
x2 x2top.
x3 x3top.
x4 x4top.
x5 x5top.
x6 x6top.
x7 x7top.
x8 x8top.
x9 x9top.
x10 x10top.
;
run;

Related

SaS 9.4: How to use different weights on the same variable without datastep or proc sql

I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here

SAS - creating indicator variables

I'm using SAS and I'd like to create an indicator variable.
The data I have is like this (DATA I HAVE):
and I want to change this to (DATA I WANT):
I have a fixed number of total time that I want to use, and the starttime has duplicate time value (in this example, c1 and c2 both started at time 3). Although the example I'm using is small with 5 names and 12 time values, the actual data is very large (about 40,000 names and 100,000 time values - so the outcome I want is a matrix with 100,000x40,000.)
Can someone please provide any tips/solution on how to handle this?
40k variables is a lot. It will be interesting to see how well this scales. How do you determine the stop time?
data have;
input starttime name :$32.;
retain one 1;
cards;
1 varx
3 c1
3 c2
5 c3x
10 c4
11 c5
;;;;
run;
proc print;
run;
proc transpose data=have out=have2(drop=_name_ rename=(starttime=time));
by starttime;
id name;
var one;
run;
data time;
if 0 then set have2(drop=time);
array _n[*] _all_;
retain _n 0;
do time=.,1 to 12;
output;
call missing(of _n[*]);
end;
run;
data want0 / view=want0;
merge time have2;
by time;
retain dummy '1';
run;
data want;
length time 8;
update want0(obs=0) want0;
by dummy;
if not missing(time);
output;
drop dummy;
run;
proc print;
run;
This will work. There may be a simpler solution that does it all in one data step. My data step creates a staggered results that has to be collapsed which I do by summing in the sort/means.
data have;
input starttime name $;
datalines;
3 c1
3 c2
5 c3
10 c4
11 c5
;
run;
data want(drop=starttime name);
set have;
array cols (*) c1-c5;
do time=1 to 100;
if starttime < time then cols(_N_)=1;
else cols(_N_)=0;
output;
end;
run;
proc sort data=want;
by time;
proc means data=want noprint;
by time;
var _numeric_;
output out=want2(drop=_type_ _freq_) sum=;
run;
I am not recommending you do it this way. You didn't provide enough information to let us know why you want a matrix of that size. You may have processing issues getting it to run.
In the line do time=1 to 100 you can change that to 100000 or whatever length.
I think the code below will work:
%macro answer_macro(data_in, data_out);
/* Deduplication of initial dataset just to assure that every variable has a unique starting time*/
proc sort data=&data_in. out=data_have_nodup; by name starttime; run;
proc sort data=data_have_nodup nodupkey; by name; run;
/*Getting min and max starttime values - here I am assuming that there is only integer values form starttime*/
proc sql noprint;
select min(starttime)
,max(starttime)
into :min_starttime /*not used. Use this (and change the loop on the next dataset) to start the time variable from the value where the first variable starts*/
,:max_starttime
from data_have_nodup
;quit;
/*Getting all pairs of name/starttime*/
proc sql noprint;
select name
,starttime
into :name1 - :name1000000
,:time1 - :time1000000
from data_have_nodup
;quit;
/*Getting total number of variables*/
proc sql noprint;
select count(*) into :nvars
from data_have_nodup
;quit;
/* Creating dataset with possible start values */
/*I'm not sure this step could be done with a single datastep, but I don't have SAS
on my PC to make tests, so I used the method below*/
data &data_out.;
do i = 1 to &max_starttime. + 1;
time = i; output;
end;
drop i;
run;
data &data_out.;
set &data_out.;
%do i = 1 %to &nvars.;
if time >= &&time&i then &&name&i = 1;
else &&name&i = 0;
%end;
run;
%mend answer_macro;
Unfortunately I don't have SAS on my machine right now, so I can't confirm that the code works. But even if it doesn't, you can use the logic in it.

How to write a concise list of variables in table of a freq when the variables are differentiated only by a suffix?

I have a dataset with some variables named sx for x = 1 to n.
Is it possible to write a freq which gives the same result as:
proc freq data=prova;
table s1 * s2 * s3 * ... * sn /list missing;
run;
but without listing all the names of the variables?
I would like an output like this:
S1 S2 S3 S4 Frequency
A 10
A E 100
A E J F 300
B 10
B E 100
B E J F 300
but with an istruction like this (which, of course, is invented):
proc freq data=prova;
table s1:sn /list missing;
run;
Why not just use PROC SUMMARY instead?
Here is an example using two variables from SASHELP.CARS.
So this is PROC FREQ code.
proc freq data=sashelp.cars;
where make in: ('A','B');
tables make*type / list;
run;
Here is way to get counts using PROC SUMMARY
proc summary missing nway data=sashelp.cars ;
where make in: ('A','B');
class make type ;
output out=want;
run;
proc print data=want ;
run;
If you need to calculate the percentages you can instead use the WAYS statement to get both the overall and the individual cell counts. And then add a data step to calculate the percentages.
proc summary missing data=sashelp.cars ;
where make in: ('A','B');
class make type ;
ways 0 2 ;
output out=want;
run;
data want ;
set want ;
retain total;
if _type_=0 then total=_freq_;
percent=100*_freq_/total;
run;
So if you have 10 variables you would use
ways 0 10 ;
class s1-s10 ;
If you just want to build up the string "S1*S2*..." then you could use a DO loop or a macro %DO loop and put the result into a macro variable.
data _null_;
length namelist $200;
do i=1 to 10;
namelist=catx('*',namelist,cats('S',i));
end;
call symputx('namelist',namelist);
run;
But here is an easy way to make such a macro variable from ANY variable list not just those with numeric suffixes.
First get the variables names into a dataset. PROC TRANSPOSE is a good way if you use the OBS=0 dataset option so that you only get the _NAME_ column.
proc transpose data=have(obs=0) ;
var s1-s10 ;
run;
Then use PROC SQL to stuff the names into a macro variable.
proc sql noprint;
select _name_
into :namelist separated by '*'
from &syslast
;
quit;
Then you can use the macro variable in your TABLES statement.
proc freq data=have ;
tables &namelist / list missing ;
run;
Car':
In short, no. There is no shortcut syntax for specifying a variable list that crosses dimension.
In long, yes -- if you create a surrogate variable that is an equivalent crossing.
Discussion
Sample data generator:
%macro have(top=5);
%local index;
data have;
%do index = 1 %to &top;
do s&index = 1 to 2+ceil(3*ranuni(123));
%end;
array V s:;
do _n_ = 1 to 5*ranuni(123);
x = ceil(100*ranuni(123));
if ranuni(123) < 0.1 then do;
ix = ceil(&top*ranuni(123));
h = V(ix);
V(ix) = .;
output;
V(ix) = h;
end;
else
output;
end;
%do index = 1 %to &top;
end;
%end;
run;
%mend;
%have;
As you probably noticed table s: created one freq per s* variable.
For example:
title "One table per variable";
proc freq data=have;
tables s: / list missing ;
run;
There is no shortcut syntax for specifying a variable list that crosses dimension.
NOTE: If you specify out=, the column names in the output data set will be the last variable in the level. So for above, the out= table will have a column "s5", but contain counts corresponding to combinations for each s1 through s5.
At each dimensional level you can use a variable list, as in level1 * (sublev:) * leaf. The same caveat for out= data applies.
Now, reconsider the original request discretely (no-shortcut) crossing all the s* variables:
title "1 table - 5 columns of crossings";
proc freq data=have;
tables s1*s2*s3*s4*s5 / list missing out=outEach;
run;
And, compare to what happens when a data step view uses a variable list to compute a surrogate value corresponding to the discrete combinations reported above.
data haveV / view=haveV;
set have;
crossing = catx(' * ', of s:); * concatenation of all the s variables;
keep crossing;
run;
title "1 table - 1 column of concatenated crossings";
proc freq data=haveV;
tables crossing / list missing out=outCat;
run;
Reality check with COMPARE, I don't trust eyeballs. If zero rows with differences (per noequal) then the out= data sets have identical counts.
proc compare noprint base=outEach compare=outCat out=diffs outnoequal;
var count;
run;
----- Log -----
NOTE: There were 31 observations read from the data set WORK.OUTEACH.
NOTE: There were 31 observations read from the data set WORK.OUTCAT.
NOTE: The data set WORK.DIFFS has 0 observations and 3 variables.
NOTE: PROCEDURE COMPARE used (Total process time)

Indicator variable for maximum value by groups

Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.

Crosstable displaying frequency combination of N variables in SAS

What I've got:
a table of 20 rows in SAS (originally 100k)
various binary attributes (columns)
What I'm looking to get:
A crosstable displaying the frequency of the attribute combinations
like this:
Attribute1 Attribute2 Attribute3 Attribute4
Attribute1 5 0 1 2
Attribute2 0 3 0 3
Attribute3 2 0 5 4
Attribute4 1 2 0 10
*The actual sum of combinations is made up and probably not 100% logical
The code I currently have:
/*create dummy data*/
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
output;
end;
run;
I guess this can be done smarter, but this seem to work. First I created a table that should hold all the frequencies:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;
Then I loop through all the combinations, inserting the count into the crosstable:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
Note that since the frequency count for (i,j)=(j,i) you do not need to do both.
I'd recommend using the built-in SAS tools for this sort of thing, and probably displaying your data slightly differently as well, unless you really want a diagonal table. e.g.
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
count = 1;
output;
end;
run;
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;
proc summary nway data = monthly_sales;
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
Either of these gives you a table with 1 row for each contribution of attributes in your data, which is slightly different from what you requested, but conveys the same information. You can force proc summary to include rows for combinations of class variables that don't exist in your data by using the completetypes option in the proc summary statement.
It's definitely worth taking the time to get familiar with proc summary if you're doing statistical analysis in SAS - you can include additional output statistics and process multiple variables with minimal additional code and processing overhead.
Update: it's possible to produce the desired table without resorting to macro logic, albeit a rather complex process:
proc summary data = monthly_sales completetypes;
ways 1 2; /*Calculate only 1 and 2-way summaries*/
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
/*Eliminate unnecessary output rows*/
data summary_table;
set summary_table;
array a{*} attribute:;
sum = sum(of a[*]);
missing = 0;
do i = 1 to dim(a);
missing + missing(a[i]);
a[i] = a[i] * count;
end;
/*We want rows where two attributes are both 1 (sum = 2),
or one attribute is 1 and the others are all missing*/
if sum = 2 or (sum = 1 and missing = dim(a) - 1);
drop i missing sum;
edge = _n_;
run;
/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
by edge;
var attribute:;
run;
/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
create table diagonal as
select a._name_ as aname,
b._name_ as bname,
a.col1 as count
from tr_table a, tr_table b
where a.edge = b.edge
group by a.edge
having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
order by aname, bname
;
quit;
/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
by aname;
id bname;
var count;
run;
/*Re-order variables and set missing values to zero*/
data want;
informat aname attribute1-attribute4;
set want;
array a{*} attribute:;
do i = 1 to dim(a);
a[i] = sum(a[i],0);
end;
drop i;
run;
Yeah, user667489 was right, I just added some extra code to get the cross-frequency table looking good. First, I created a table with 10 million rows and 10 variables:
data monthly_sales (drop=i);
do i=1 to 10000000;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
Attribute5=rand("Normal")>0.5;
Attribute6=rand("Normal")>0.5;
Attribute7=rand("Normal")>0.5;
Attribute8=rand("Normal")>0.5;
Attribute9=rand("Normal")>0.5;
Attribute10=rand("Normal")>0.5;
output;
end;
run;
Create an empty 10x10 crosstable:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
output;output;output;output;output;output;output;output;output;output;
run;
Create a frequency table using proc freq:
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
/ out = frequency_table;
run;
Loop through all the combinations of Attributes and sum the "count" variable. Insert it into the crosstable:
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;