I'm using PROC REPORT to generate a report of weighted sums. There are 2 columns that need to be summarized, both with the MEAN statistic. On top of that, I want to output the total weight.
I have 2 issues.
I cannot seem to get the title on each sum to reflect the variable
being summed.
I need a different format for each column.
Here is some sample data:
data test;
format lev1-lev3 $3. weight percent10.2 duration 6.2 convexity 6.4;
informat weight percent10.2 duration 6.2 convexity 6.4;
input lev1 lev2 lev3 weight duration convexity;
datalines;
A C H 16.11% 3.21 0.6182
A C I 3.83% 9.06 1.2244
A D J 7.67% 2.21 3.4010
A D K 16.90% 3.98 0.0303
B E L 2.68% 1.88 1.9515
B E M 16.68% 4.36 3.1851
B F N 20.79% 2.64 0.1145
B F O 15.34% 5.55 2.4408
;
run;
I've tried a number of ways to define things in PROC REPORT. Here is one of many:
proc report data=test nowd out=report;
column lev1 lev2 lev3 duration,(SUMWGT MEAN) convexity,(Mean);
weight weight;
define lev1 / group;
define lev2 / group;
define lev3 / group;
define duration / 'Duration' ;
define sumwgt / 'Weight' format=percent10.2;
define mean / '' format=6.2;
define convexity / 'Convexity';
*define mean / 'Convexity' format=6.4;
break before lev1 / summarize ;
break before lev2 / summarize ;
rbreak before / summarize;
run;
My ultimate goal would be something like:
Lev1 Lev2 Lev3 Weight Duration Convextiy
100.00% 3.88 1.3943
A 44.51% 3.83 0.9267
...
I've also played with PROC TABULATE but I am less of a fan of the tables it presents.
Example TABULATE mess:
PROC TABULATE DATA=WORK.test;
VAR duration convexity;
CLASS LEV1 / ORDER=UNFORMATTED MISSING;
CLASS LEV2 / ORDER=UNFORMATTED MISSING;
CLASS LEV3 / ORDER=UNFORMATTED MISSING;
TABLE
/* Row Dimension */
ALL={LABEL="+"}
LEV1*(
ALL={LABEL="+"}
LEV2*(
ALL={LABEL="+"}
LEV3 ) )
,
/* Column Dimension */
duration={LABEL="Weight"}*SumWgt={LABEL=""}*f=percent10.2
duration={LABEL="Duration"}*Mean={LABEL=""}*f=6.2
convexity={LABEL="Convexity"}*Mean={LABEL=""}*f=6.4;
WEIGHT weight;
RUN;
I think you'll have challenges getting exactly what you want from PROC REPORT. Maybe Cynthia#SAS could figure it out, I don't know, but getting the row headers right in particular will be extremely challenging.
I would suggest pre-processing the means (using PROC MEANS or similar) and then REPORTing that result. Very easy to do.
This may be close to what you want, for example:
proc means data=test;
class lev1 lev2 lev3;
var duration convexity;
weight weight;
types () lev1 lev1*lev2 lev1*lev2*lev3;
output out=test_out
sumwgt(duration)=sumwgt mean(duration)= mean(convexity)=;
run;
proc report data=test_out;
columns lev1-lev3 sumwgt duration convexity;
define lev1/order missing;
define lev2/order missing;
define lev3/order missing;
define sumwgt/display format=percent9.2;
define duration/display format=6.2;
define convexity/display format=6.4;
run;
Related
I'm attempting to generate an automated report that combines counts, row percentages, and chi-squared p-values from two-way proc freq output for a series of variables.
I'd like the counts and percentages for Variable A to be displayed under separate headers for each category of Variable B.
I'm almost there, but running the following test code on the sashelp.cars dataset produces a report that has offset rows.
Is it possible to consolidate the rows by Cylinder values so I don't have so many empty cells in the table?
proc freq data=sashelp.cars;
tables origin*cylinders / chisq totpct outpct list out=freqpct(keep=cylinders origin count pct_row);
output out=chisq(keep=N P_PCHI) all;
run;
data freqpct;
set freqpct;
var=1;
run;
data chisq;
set chisq;
var=1;
run;
proc sql;
create table chisq_freqpct
as select *
from freqpct a
inner join
chisq b
on a.var=b.var;
quit;
proc report data=chisq_freqpct;
column cylinders origin,(count pct_row) N P_PCHI;
define cylinders / display;
define origin / across;
define count / display;
define pct_row / display;
define N / group;
define P_PCHI / group;
run;
You can use / group for cylinders.
Example:
data chisq_freqpct;
if _n_ = 1 then set chisq;
set freqpct;
run;
title "sashelp.cars";
proc format;
value blank low-high = ' ';
proc report data=chisq_freqpct split=' ';
column cylinders origin,(count pct_row) N p_pchi;
define cylinders / group ;
define origin / across;
define N / across;
define p_pchi / across;
compute n; call define (8, 'format', 'blank.'); endcomp;
compute p_pchi; call define (9, 'format', 'blank.'); endcomp;
run;
The across for N and P_PCHI places their values in the header.
You could instead have placed the values in macro variables and resolved those in a title statement or grouped header text.
Use GROUP for cylinder and MAX or MIN for N and P_PCHI.
Only attach the N and P_CHI values to the first observation. Which means you either need to exclude the missing values of CYLINDERS and ORIGIN in the PROC FREQ step or add the MISSING keyword to the PROC REPORT step.
proc freq data=sashelp.cars noprint;
* where 0=cmiss(origin,cylinders);
tables origin*cylinders / chisq outpct out=freqpct(keep=cylinders origin count pct_row);
output out=chisq(keep=N P_PCHI ) all;
run;
data chisq_freqpct;
if _n_ = 1 then set chisq;
else call missing(of _all_);
set freqpct;
run;
options missing=' ';
proc report data=chisq_freqpct split=' ' missing;
column cylinders origin,(count pct_row) n p_pchi;
define cylinders / group ;
define origin / across;
define n / max;
define p_pchi / max;
run;
options missing='.';
proc means data = data1 stackODSoutput MIN P10 P25 P50 P75 P90 MAX N NMISS SUM nolabels maxdec=3;
var var1 var2;
output out = output;
run;
From the generated report, I can get all percentile and SUM. but the output data just provide me basic statistics with N, MIN, MAX, MEAN and std.
How can I also output the percentile and sum?
For output datasets in proc means, you need to specify which statistics you'd like within the output statement. Think of the proc statement as only controlling the visual output. Try this instead:
proc means data=sashelp.cars;
var horsepower MPG_City MPG_Highway;
output out=output
sum=
mean=
median=
std=
min=
max=
p10=
p25=
p75=
p90=
/ autoname
;
run;
Note that none of the statistics have anything after the =. The autoname option is automatically naming the statistic variables.
To make it easier to read, we can change the format of the output table. The naming convention of all variables is <variable>_<statistic>. Knowing this, we can transpose the table, separate out the variable and statistics from the name, then re-transpose it into a nicer format.
proc transpose data=output out=output_transposed;
var _NUMERIC_;
run;
data _want(index=(variable) );
set output_transposed;
Stat = scan(_NAME_, -1, '_');
Variable = tranwrd(_NAME_, cats('_', Stat), '');
keep Variable Stat COL1;
rename COL1 = Value;
run;
proc transpose data=_want out=want(drop=_NAME_);
by variable;
id stat;
var Value;
run;
I attempt to crate histograms plot via proc univariate. The target is to crate the distribution with bins of 0.1 width from 0 to 1.5 and then all the remaining in one bin.
I applied the following code to identify the range from 0 to 1.5, while it cannot manage the rest. How can I correct the code?
proc univariate data=HAVE;
where pred between 0 and 1.5;
var pred;
histogram pred/ vscale=percent midpoints=0 to 2 by 0.1 normal (noprint);
run;
You can try something like the following code to combine two Histograms by creating two variables from one variable:
/*Temporary DS with values ranging from 01. to 2.0*/
data have;
do i=0.1 to 2.0 by 0.1;
output;
end;
rename i=pred;
run;
/*Creating two variables x(0.1-1.5) and y(1.6-2.0)*/
data have;
set have;
if pred<1.6 then x=pred;
else y=pred;
drop pred;
run;
/*Combine two Histograms*/
proc sgplot data=have;
histogram x / nbins=15 binwidth=0.1;
density x / type=normal;
histogram y / nbins=5 binwidth=1.0;
density y / type=normal;
keylegend / location=inside position=topright noborder across=2;
xaxis display=(nolabel) values=(0.1 to 2.5 by 0.1);
run;
Create your own groups
Create a format so it shows the way you'd like
Plot it with SGPLOT
*create your own groups for data, especially the last group;
data mileage;
set sashelp.cars;
mpg_group=floor(mpg_highway / 10);
if mpg_group in (5, 6, 7) then
mpg_group=5;
keep mpg_highway mpg_group;
run;
*format to control display;
proc format;
value mpg_fmt 1='0 to 10' 2='11 to 20' 3='21 to 30' 4='31 to 40' 5='40+';
run;
*plot the data;
proc sgplot data=mileage;
vbar mpg_group /stat=freq barwidth=1;
format mpg_group mpg_fmt.;
run;
I have the following proc report
proc report data=sashelp.class;
col
sex
age
weight
;
define sex / group;
define age / group;
define weight / analysis sum;
run;
However I do not want to show the sum of weight. Instead I would like to have the proportion of the grouped sum. So first row should be 6.23%. How can I achieve this?
Now I have found a workaround:
proc sql noprint;
CREATE TABLE class AS
SELECT a.*
,b.sumweight
FROM sashelp.class a
LEFT JOIN (SELECT sex, sum(weight) as sumweight
FROM sashelp.class
GROUP BY sex
) b
ON a.sex=b.sex
;
quit;
proc report data=class;
col
sex
age
weight
sumweight
perc
;
define sex / group;
define age / group;
define weight / analysis sum;
define sumweight / analysis mean noprint;
define perc / computed format=percent6.2;
compute perc;
perc = weight.sum/sumweight.mean;
endcomp;
run;
But maybe there is a solution without additional proc sql step...
I'm looking for a macro or something in SAS that can help me in isolating the outliers from a dataset. I define an outlier as: Upperbound: Q3+1.5(IQR) Lowerbound: Q1-1.5(IQR). I have the following SAS code:
title 'Fall 2015';
proc univariate data = fall2015 freq;
var enrollment_count;
histogram enrollment_count / vscale = percent vaxis = 0 to 50 by 5 midpoints = 0 to 300 by 5;
inset n mean std max min range / position = ne;
run;
I would like to get rid of the outliers from fall2015 dataset. I found some macros, but no luck in working the macro. Several have a class variable which I don't have. Any ideas how to separate my data?
Here's a macro I wrote a while ago to do this, under slightly different rules. I've modified it to meet your criteria (1.5).
Use proc means to calculate Q1/Q3 and IQR (QRANGE)
Build Macro to cap based on rules
Call macro using call execute and boundaries set, using the output from step 1.
*Calculate IQR and first/third quartiles;
proc means data=sashelp.class stackods n qrange p25 p75;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p25-1.5*qrange;
upper=p75+1.5*qrange;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;