I'm trying to reference values that are in a stats table, as such:
/* Calculate Median and IQR */
PROC UNIVARIATE DATA = kddcup98(drop=TARGET_B) OUTTABLE= boxStats(keep=_VAR_ _Q1_ _Q3_ _QRANGE_) NOPRINT;
RUN;
/* Calculate upper and lower bounds */
DATA boxStats;
SET boxStats;
upper_bound = _Q3_ + 1.5*_QRANGE_;
lower_bound = _Q3_ - 1.5*_QRANGE_;
RUN;
DATA kddcup98_continuous;
SET kddcup98_continuous;
ARRAY Num_Col[*] _NUMERIC_;
DO i = 1 to dim(Num_Col);
IF Num_Col[i] > boxStats[i, "upper_bound"] OR Num_Col[i] < boxStats[i, "lower_bound"] THEN Num_Col[i] = .;
END;
RUN;
I have the main data table and a table of stats from which I computed upper and lower bounds. I need to reference those values from the boxStats table. How I can I reference those values?
Use OUTTABLE PROC statement option.
OUTTABLE=SAS-data-set
creates an output data set that contains univariate statistics arranged in tabular form, with one observation per analysis variable. See the section OUTTABLE= Output Data Set for details.
Related
I have a SAS dataset with 700 columns (variables). For all 700 of them, I want to cap all values below the 1st percentile to the 1st percentile and all values above the 99th percentile to the 99th percentile. I want to do this iteratively for all 700 variables without having to specify their names explicitly.
How can I do this?
Perhaps slightly easier than the hash table - and somewhat faster, I believe - is using the horizontal output of proc means, and then using an array.
proc means data=sashelp.prdsale;
var _numeric_;
output out=quantiles p1= p99= /autoname;
run;
proc sql;
select name
into :numlist separated by ' '
from dictionary.columns
where libname='SASHELP' and memname='PRDSALE' and type='num';
quit;
data prdsale_capped;
set sashelp.prdsale;
if _n_ eq 1 then set quantiles;
array vars &numlist.;
array p1 actual_p1--month_p1;
array p99 actual_p99--month_p99;
do _i = 1 to dim(vars);
vars[_i] = max(min(vars[_i],p99[_i]),p1[_i]);
end;
run;
Basically it's just setting up three arrays - vars, p1, p99 - and then you have all 3 values for every numeric variable on the PDV and can just compare during a single array traversal.
For a production process I'd probably not use the -- but instead make 3 lists from proc sql and make 100% sure they're in the same order by using an order by.
You can do this with proc means and a hash table lookup. Let's create some test data with 100 variables pulled from a normal distribution. For testing, we'll change all the variables in the first and second rows to really big and really small numbers.
Our approach: create a lookup table where we can find the variable's name, pull its percentiles, and compare its value against those percentiles.
data have;
array var[100];
do i = 1 to 100;
do j = 1 to dim(var);
var[j] = rand('normal');
/* Test values */
if(i = 1) then var[j] = 99999;
if(i = 2) then var[j] = -99999;
end;
output;
end;
drop i j;
run;
Data:
var1 var2 var3 ...
99999 99999 99999 ...
-99999 -99999 -99999 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
Let's get all the percentiles with proc means. You might be tempted to use output out=, but it does not create the data in a vertical lookup table that's easy for us to use in this manner; however, the stackODSOutput option on proc means does. More info on this from Rick Wicklin.
We'll use ods select none so we don't render a large table but still produce the dataset that drives the table.
/* Get a dataset of all 1st and 99th percentiles for each variable */
ods select none;
proc means data=have stackODSOutput p1 p99;
var var1-var100;
ods output summary = percentiles;
run;
ods select all;
Note that all the percentiles will be the same in this case. This is expected. We set all the variables in the first and second rows to the same big and small numbers for easy testing.
Data:
Variable P1 P99 ...
var1 -50001 50001 ...
var2 -50001 50001 ...
var3 -50001 50001 ...
var4 -50001 50001 ...
... ... ... ...
Now we'll use our lookup approach. We know our variable names and we can store them in an array. We can loop through that array, look up the variable in the hash table by name with vname(), and get its percentile.
data want;
set have;
array var[*] var1-var100;
/* Load a table of these values into memory and search for each percentile.
Think of this like a simple lookup table that floats out in memory.
*/
if(_N_ = 1) then do;
length variable $32.;
dcl hash pctiles(dataset: 'percentiles');
pctiles.defineKey('variable');
pctiles.defineData('p1', 'p99');
pctiles.defineDone();
call missing(p1, p99);
end;
/* Get the 1st and 99th percentile of each variable.
If the variable's name matches the variable name
in the hash table, check the variable's value
against the lookup percentile.
Cap it if it's above or below the percentile.
*/
do i = 1 to dim(var);
if(pctiles.Find(key:vname(var[i]) ) = 0) then do;
if(var[i] < p1) then var[i] = p1;
else if(var[i] > p99) then var[i] = p99;
end;
end;
drop i variable p1 p99;
run;
Output:
var1 var2 var3 ...
50000.532908 50000.721522 99999 ...
-50000.61447 -50000.92196 -50001.19549 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
If your variables do not follow an easy sequential name, you can use the -- shortcut. For example, varA varB varC varD can be selected by varA--varD.
i am very new to sas and I have the following work table
I want to create a new table in which column Date and Z remain the same, but all values in column X are replaced with the minimum value in column X and all values in column Y are replaced with the minimum value in column y.
Sample output is as follows
You can use the fact that PROC SQL will automatically remerge aggregate statistics back onto detail observations.
proc sql;
create table want as
select date, x, min(y) as y, min(z) as z
from have
;
quit;
If you don't want to use proc sql statement you can modify this code found from https://blogs.sas.com/content/iml/2014/12/01/max-and-min-rows-and-cols.html
data MinMaxRows;
set sashelp.Iris;
array x {*} _numeric_; /* x[1] is 1st var,...,x[4] is 4th var */
min = min(of x[*]); /* min value for this observation */
max = max(of x[*]); /* max value for this observation */
run;
proc print data=MinMaxRows(obs=7);
var _numeric_;
run;
I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;
The dataset looks like this:
colx coly colz
0 1 0
0 1 1
0 1 0
Required output:
Colname value count
colx 0 3
coly 1 3
colz 0 2
colz 1 1
The following code works perfectly...
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
data freq;
retain colname column_value;
set outfreq;
colname = scan(tables, 2, ' ');
column_Value = trim(left(vvaluex(colname)));
keep colname column_value frequency percent;
run;
... but I believe that's not efficient. Say I have 1000 columns, running prof freq on all 1000 columns is not efficient. Is there any other efficient way with out using the proc freq that accomplishes my desired output?
One of the most efficient mechanisms for computing frequency counts is through a hash object set up for reference counting via the suminc tag.
The SAS documentation for "Hash Object - Maintaining Key Summaries" demonstrates the technique for a single variable. The following example goes one step further and computes for each variable specified in an array. The suminc:'one' specifies that each use of ref will add the value of one to an internal reference sum. While iterating over the distinct keys for output, the frequency count is extracted via the sum method.
* one million data values;
data have;
array v(1000);
do row = 1 to 1000;
do index = 1 to dim(v);
v(index) = ceil(100*ranuni(123));
end;
output;
end;
keep v:;
format v: 4.;
run;
* compute frequency counts via .ref();
data freak_out(keep=name value count);
length name $32 value 8;
declare hash bins(ordered:'a', suminc:'one');
bins.defineKey('name', 'value');
bins.defineData('name', 'value');
bins.defineDone();
one = 1;
do until (end_of_data);
set have end=end_of_data;
array v v1-v1000;
do index = 1 to dim(v);
name = vname(v(index));
value = v(index);
bins.ref();
end;
end;
declare hiter out('bins');
do while (out.next() = 0);
bins.sum(sum:count);
output;
end;
run;
Note Proc FREQ uses standard grammars, variables can be a mixed of character and numeric, and has lots of additional features that are specified through options.
I think the most time consuming part in your code is generation of the ODS report. You can transpose the data before applying the freq. The below example does the task for 1000 rows with 1000 variables in few seconds. If you do it using ODS it may take much longer.
data dummy;
array colNames [1000] col1-col1000;
do line = 1 to 1000;
do j = 1 to dim(colNames);
colNames[j] = int(rand("uniform")*100);
end;
output;
end;
drop j;
run;
proc transpose
data = dummy
out = dummyTransposed (drop = line rename = (_name_ = colName col1 = value))
;
var col1-col1000;
by line;
run;
proc freq data = dummyTransposed noprint;
tables colName*value / out = result(drop = percent);
run;
Perhaps this statement from the comments is the real problem.
I felt like the odsoutput with proc freq is slowing down and creating
huge logs and outputs. think of 10,000 variables and million records.
I felt there should be another way of accomplishing this and arrays
seems to be a great fit
You can tell ODS not to produce the printed output if you don't want it.
ods exclude all ;
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
ods exclude none ;
Data IV_SAS;
set IV;
Total_Loans=Goods+Bads;
Dist_Loans=Total_Loans/sum(Total_Loans));
Dist_Goods=Goods/Sum(Goods);
Dist_Bads=Bads/Sum(Bads);
Difference=Dist_Goods-Dist_Bads;
WOE=log10(Dist_goods/Dist_Bads);
IV=WOE*Difference;
run;
I am facing issues in calculating sum of (Total Loans),its calculating Row total instead of column total.
That's how Base SAS works - it operates on row level in the data step.
You would want to use PROC MEANS or PROC TABULATE or similar proc and find the column total there, then merge that on (or combine in another method).
For example:
proc means data=sashelp.class;
var age height weight;
output out=class_means sum(age)=age_sum sum(height)=height_sum sum(weight)=weight_Sum;
run;
data class;
if _n_=1 then set class_means;
set sashelp.class;
age_prop = age/age_sum;
height_prop = height/height_sum;
weight_prop = weight/weight_Sum;
run;
Alternately, use SAS/IML or PROC SQL, both of which will operate on the column level when asked inline (though I think the above solution is likely superior in speed to both due to lower overhead).
data a;
input goods bads;
datalines;
36945 33337
23820 21761
26990 24647
33195 30299
43755 39014
46100 41100
89765 79978
25940 23508
35940 32506
31840 28846
33430 30366
34480 31388
36640 33129
39640 35992
42490 38325
44240 40075
42840 38840
49690 44936
69190 64740
;
run;
proc sql;
create table b as
select goods,bads,
sum(goods,bads) as Total_Loans format=dollar10.,
sum(goods)as Column_goods_tot format=dollar10. ,
sum(bads) as Column_bads_tot format=dollar10. ,
sum(calculated Column_goods_tot, calculated Column_bads_tot) as Column_Total_Loans format=dollar10. ,
(calculated Total_Loans/calculated Column_Total_Loans) as Dist_Loans
/*add more code to calculate Dist_Goods, Dist_Bads, etc..*/
from a;
quit;
/*Column totals only*/
proc sql;
create table c as
select
sum(goods)as Column_goods_tot format=dollar10. ,
sum(bads) as Column_bads_tot format=dollar10. ,
sum(calculated Column_goods_tot, calculated Column_bads_tot) as Column_Total_Loans format=dollar12.
from a;
quit;