I'm attempting to generate an automated report that combines counts, row percentages, and chi-squared p-values from two-way proc freq output for a series of variables.
I'd like the counts and percentages for Variable A to be displayed under separate headers for each category of Variable B.
I'm almost there, but running the following test code on the sashelp.cars dataset produces a report that has offset rows.
Is it possible to consolidate the rows by Cylinder values so I don't have so many empty cells in the table?
proc freq data=sashelp.cars;
tables origin*cylinders / chisq totpct outpct list out=freqpct(keep=cylinders origin count pct_row);
output out=chisq(keep=N P_PCHI) all;
run;
data freqpct;
set freqpct;
var=1;
run;
data chisq;
set chisq;
var=1;
run;
proc sql;
create table chisq_freqpct
as select *
from freqpct a
inner join
chisq b
on a.var=b.var;
quit;
proc report data=chisq_freqpct;
column cylinders origin,(count pct_row) N P_PCHI;
define cylinders / display;
define origin / across;
define count / display;
define pct_row / display;
define N / group;
define P_PCHI / group;
run;
You can use / group for cylinders.
Example:
data chisq_freqpct;
if _n_ = 1 then set chisq;
set freqpct;
run;
title "sashelp.cars";
proc format;
value blank low-high = ' ';
proc report data=chisq_freqpct split=' ';
column cylinders origin,(count pct_row) N p_pchi;
define cylinders / group ;
define origin / across;
define N / across;
define p_pchi / across;
compute n; call define (8, 'format', 'blank.'); endcomp;
compute p_pchi; call define (9, 'format', 'blank.'); endcomp;
run;
The across for N and P_PCHI places their values in the header.
You could instead have placed the values in macro variables and resolved those in a title statement or grouped header text.
Use GROUP for cylinder and MAX or MIN for N and P_PCHI.
Only attach the N and P_CHI values to the first observation. Which means you either need to exclude the missing values of CYLINDERS and ORIGIN in the PROC FREQ step or add the MISSING keyword to the PROC REPORT step.
proc freq data=sashelp.cars noprint;
* where 0=cmiss(origin,cylinders);
tables origin*cylinders / chisq outpct out=freqpct(keep=cylinders origin count pct_row);
output out=chisq(keep=N P_PCHI ) all;
run;
data chisq_freqpct;
if _n_ = 1 then set chisq;
else call missing(of _all_);
set freqpct;
run;
options missing=' ';
proc report data=chisq_freqpct split=' ' missing;
column cylinders origin,(count pct_row) n p_pchi;
define cylinders / group ;
define origin / across;
define n / max;
define p_pchi / max;
run;
options missing='.';
Related
I'm using tagsets.excelxp in SAS to output dozens of two-way tables to an .xml file. Is there syntax that will suppress rows (frequencies and percents) if the frequency in that row is less than 10? I need to apply that in order to de-identify the results, and it would be ideal if I could automate the process rather than use conditional formatting in each of the outputted tables. Below is the syntax I'm using to create the tables.
ETA: I need those suppressed values to be included in the computation of column frequencies and percents, but I need them to be invisible in the final table (examples of options I have considered: gray out the entire row, turn the font white so it doesn't show for those cells, replace those values with an asterisk).
Any suggestions would be greatly appreciated!!!
Thanks!
dr j
%include 'C:\Users\Me\Documents\excltags.tpl';
ods tagsets.excelxp file = "C:\Users\Me\Documents\Participation_rdg_LSS_3-8.xml"
style = MonoChromePrinter
options(
convert_percentages = 'yes'
embedded_titles = 'yes'
);
title1 'Participation';
title2 'LSS-Level';
title3 'Grades 3-8';
title4 'Reading';
ods noproctitle;
proc sort data = part_rdg_3to8;
by flag_accomm flag_participation lss_nm;
run;
proc freq data = part_rdg_3to8;
by flag_accomm flag_participation;
tables lss_nm*grade_p / crosslist nopercent;
run;
ods tagsets.excelxp close;
D.Jay: Proc FREQ does not contain any options for conditionally masking cells of it's output. You can leverage the output data capture capability of the ODS system with a follow-up Proc REPORT to produce the desired masked output.
I am guessing on the roles of the lss and grade_p as to be a skill level and a student grade level respectively.
Generate some sample data
data have;
do student_id = 1 to 10000;
flag1 = ranuni(123) < 0.4;
flag2 = ranuni(123) < 0.6;
lss = byte(65+int(26*ranuni(123)));
grade = int(6*ranuni(123));
* at every third lss force data to have a low percent of grades < 3;
if mod(rank(lss),3)=0 then
do until (grade > 2 or _n_ < 0.15);
grade = int(6*ranuni(123));
_n_ = ranuni(123);
end;
else if mod(rank(lss),7)=0 then
do until (grade < 3 or _n_ < 0.15);
grade = int(6*ranuni(123));
_n_ = ranuni(123);
end;
output;
end;
run;
proc sort data=have;
by flag1 flag2;
*where lss in ('A' 'B') and flag1 and flag2; * remove comment to limit amount of output during 'learning the code' phase;
run;
Perform the Proc FREQ
Only capture the data corresponding to the output that would have been generated
ods _all_ close;
* ods trace on;
/* trace will log the Output names
* that a procedure creates, and thus can be captured
*/
ods output CrossList=crosslist;
proc freq data=have;
by flag1 flag2;
tables lss * grade / crosslist nopercent;
run;
ods output close;
ods trace off;
Now generate output to your target ODS destination (be it ExcelXP, html, pdf, etc)
Reference output of which needs to be produced an equivalent having masked values.
* regular output of FREQ, to be compare to masked output
* of some information via REPORT;
proc freq data=have;
by flag1 flag2;
tables lss * grade / crosslist nopercent;
run;
Proc REPORT has great features for producing conditional output. The compute block is used to select either a value or a masked value indicator for output.
options missing = ' ';
proc format;
value $lss_report ' '= 'A0'x'Total';
value grade_report . = 'Total';
value blankfrq .b = '*masked*' ._=' ' other=[best8.];
value blankpct .b = '*masked*' ._=' ' other=[6.2];
proc report data=CrossList;
by flag1 flag2;
columns
('Table of lss by grade'
lss grade
Frequency RowPercent ColPercent
FreqMask RowPMask ColPMask
)
;
define lss / order order=formatted format=$lss_report. missing;
define grade / display format=grade_report.;
define Frequency / display noprint;
define RowPercent / display noprint;
define ColPercent / display noprint;
define FreqMask / computed format=blankfrq. 'Frequency' ;
define RowPMask / computed format=blankpct. 'Row/Percent';
define ColPMask / computed format=blankpct. 'Column/Percent';
compute FreqMask;
if 0 <= RowPercent < 10
then FreqMask = .b;
else FreqMask = Frequency;
endcomp;
compute RowPMask;
if 0 <= RowPercent < 10
then RowPMask = .b;
else RowPMask = RowPercent;
endcomp;
compute ColPMask;
if 0 <= RowPercent < 10
then ColPMask = .b;
else ColPMask = ColPercent;
endcomp;
run;
ods html close;
If you have to produce lots of cross listings for different data sets, the code is easily macro-ized.
When I've done this in the past, I've first generated the frequency to a dataset, then filtered out the N, then re-printed the dataset (using tabulate usually).
If you can't recreate the frequency table perfectly from the freq output, you can do a simple frequency, check which IDs or variables or what have you to exclude, and then filter them out from the input dataset and rerun the whole frequency.
I don't believe that you can with PROC FREQ, but you can easily replicate your code with PROC TABULATE and you can use a custom format there to mask the numbers. This example sets it to M for missing and N for less than 5 and with one decimal place for the rest of the values. You could also replace the M/N with a space (single space) to have no values shown instead.
*Create a format to mask values less than 5;
proc format;
value mask_fmt
. = 'M' /*missing*/
low - < 5='N' /*less than 5 */
other = [8.1]; /*remaining values with one decimal place*/
run;
*sort data for demo;
proc sort data=sashelp.cars out=cars;
by origin;
run;
ods tagsets.excelxp file='/folders/myfolders/demo.xml';
*values partially masked;
proc tabulate data=cars;
where origin='Asia';
by origin;
class make cylinders;
table make, cylinders*n*f=mask_fmt. ;
run;
ods tagsets.excelxp close;
This was tested on SAS UE.
EDIT: Forgot the percentage piece, so this likely will not work for that, primarily because I don't think you'll get the percentages the same as in PROC FREQ (appearance) so it depends on how important that is to you. The other possibility to accomplish this would be to modify the PROC FREQ template to use the custom format as above. Unfortunately I do not have time to mock this up for you but maybe someone else can. I'll leave this here to help get you started and delete it later on.
I have a dataset with some variables named sx for x = 1 to n.
Is it possible to write a freq which gives the same result as:
proc freq data=prova;
table s1 * s2 * s3 * ... * sn /list missing;
run;
but without listing all the names of the variables?
I would like an output like this:
S1 S2 S3 S4 Frequency
A 10
A E 100
A E J F 300
B 10
B E 100
B E J F 300
but with an istruction like this (which, of course, is invented):
proc freq data=prova;
table s1:sn /list missing;
run;
Why not just use PROC SUMMARY instead?
Here is an example using two variables from SASHELP.CARS.
So this is PROC FREQ code.
proc freq data=sashelp.cars;
where make in: ('A','B');
tables make*type / list;
run;
Here is way to get counts using PROC SUMMARY
proc summary missing nway data=sashelp.cars ;
where make in: ('A','B');
class make type ;
output out=want;
run;
proc print data=want ;
run;
If you need to calculate the percentages you can instead use the WAYS statement to get both the overall and the individual cell counts. And then add a data step to calculate the percentages.
proc summary missing data=sashelp.cars ;
where make in: ('A','B');
class make type ;
ways 0 2 ;
output out=want;
run;
data want ;
set want ;
retain total;
if _type_=0 then total=_freq_;
percent=100*_freq_/total;
run;
So if you have 10 variables you would use
ways 0 10 ;
class s1-s10 ;
If you just want to build up the string "S1*S2*..." then you could use a DO loop or a macro %DO loop and put the result into a macro variable.
data _null_;
length namelist $200;
do i=1 to 10;
namelist=catx('*',namelist,cats('S',i));
end;
call symputx('namelist',namelist);
run;
But here is an easy way to make such a macro variable from ANY variable list not just those with numeric suffixes.
First get the variables names into a dataset. PROC TRANSPOSE is a good way if you use the OBS=0 dataset option so that you only get the _NAME_ column.
proc transpose data=have(obs=0) ;
var s1-s10 ;
run;
Then use PROC SQL to stuff the names into a macro variable.
proc sql noprint;
select _name_
into :namelist separated by '*'
from &syslast
;
quit;
Then you can use the macro variable in your TABLES statement.
proc freq data=have ;
tables &namelist / list missing ;
run;
Car':
In short, no. There is no shortcut syntax for specifying a variable list that crosses dimension.
In long, yes -- if you create a surrogate variable that is an equivalent crossing.
Discussion
Sample data generator:
%macro have(top=5);
%local index;
data have;
%do index = 1 %to ⊤
do s&index = 1 to 2+ceil(3*ranuni(123));
%end;
array V s:;
do _n_ = 1 to 5*ranuni(123);
x = ceil(100*ranuni(123));
if ranuni(123) < 0.1 then do;
ix = ceil(&top*ranuni(123));
h = V(ix);
V(ix) = .;
output;
V(ix) = h;
end;
else
output;
end;
%do index = 1 %to ⊤
end;
%end;
run;
%mend;
%have;
As you probably noticed table s: created one freq per s* variable.
For example:
title "One table per variable";
proc freq data=have;
tables s: / list missing ;
run;
There is no shortcut syntax for specifying a variable list that crosses dimension.
NOTE: If you specify out=, the column names in the output data set will be the last variable in the level. So for above, the out= table will have a column "s5", but contain counts corresponding to combinations for each s1 through s5.
At each dimensional level you can use a variable list, as in level1 * (sublev:) * leaf. The same caveat for out= data applies.
Now, reconsider the original request discretely (no-shortcut) crossing all the s* variables:
title "1 table - 5 columns of crossings";
proc freq data=have;
tables s1*s2*s3*s4*s5 / list missing out=outEach;
run;
And, compare to what happens when a data step view uses a variable list to compute a surrogate value corresponding to the discrete combinations reported above.
data haveV / view=haveV;
set have;
crossing = catx(' * ', of s:); * concatenation of all the s variables;
keep crossing;
run;
title "1 table - 1 column of concatenated crossings";
proc freq data=haveV;
tables crossing / list missing out=outCat;
run;
Reality check with COMPARE, I don't trust eyeballs. If zero rows with differences (per noequal) then the out= data sets have identical counts.
proc compare noprint base=outEach compare=outCat out=diffs outnoequal;
var count;
run;
----- Log -----
NOTE: There were 31 observations read from the data set WORK.OUTEACH.
NOTE: There were 31 observations read from the data set WORK.OUTCAT.
NOTE: The data set WORK.DIFFS has 0 observations and 3 variables.
NOTE: PROCEDURE COMPARE used (Total process time)
I have the following proc report
proc report data=sashelp.class;
col
sex
age
weight
;
define sex / group;
define age / group;
define weight / analysis sum;
run;
However I do not want to show the sum of weight. Instead I would like to have the proportion of the grouped sum. So first row should be 6.23%. How can I achieve this?
Now I have found a workaround:
proc sql noprint;
CREATE TABLE class AS
SELECT a.*
,b.sumweight
FROM sashelp.class a
LEFT JOIN (SELECT sex, sum(weight) as sumweight
FROM sashelp.class
GROUP BY sex
) b
ON a.sex=b.sex
;
quit;
proc report data=class;
col
sex
age
weight
sumweight
perc
;
define sex / group;
define age / group;
define weight / analysis sum;
define sumweight / analysis mean noprint;
define perc / computed format=percent6.2;
compute perc;
perc = weight.sum/sumweight.mean;
endcomp;
run;
But maybe there is a solution without additional proc sql step...
Data IV_SAS;
set IV;
Total_Loans=Goods+Bads;
Dist_Loans=Total_Loans/sum(Total_Loans));
Dist_Goods=Goods/Sum(Goods);
Dist_Bads=Bads/Sum(Bads);
Difference=Dist_Goods-Dist_Bads;
WOE=log10(Dist_goods/Dist_Bads);
IV=WOE*Difference;
run;
I am facing issues in calculating sum of (Total Loans),its calculating Row total instead of column total.
That's how Base SAS works - it operates on row level in the data step.
You would want to use PROC MEANS or PROC TABULATE or similar proc and find the column total there, then merge that on (or combine in another method).
For example:
proc means data=sashelp.class;
var age height weight;
output out=class_means sum(age)=age_sum sum(height)=height_sum sum(weight)=weight_Sum;
run;
data class;
if _n_=1 then set class_means;
set sashelp.class;
age_prop = age/age_sum;
height_prop = height/height_sum;
weight_prop = weight/weight_Sum;
run;
Alternately, use SAS/IML or PROC SQL, both of which will operate on the column level when asked inline (though I think the above solution is likely superior in speed to both due to lower overhead).
data a;
input goods bads;
datalines;
36945 33337
23820 21761
26990 24647
33195 30299
43755 39014
46100 41100
89765 79978
25940 23508
35940 32506
31840 28846
33430 30366
34480 31388
36640 33129
39640 35992
42490 38325
44240 40075
42840 38840
49690 44936
69190 64740
;
run;
proc sql;
create table b as
select goods,bads,
sum(goods,bads) as Total_Loans format=dollar10.,
sum(goods)as Column_goods_tot format=dollar10. ,
sum(bads) as Column_bads_tot format=dollar10. ,
sum(calculated Column_goods_tot, calculated Column_bads_tot) as Column_Total_Loans format=dollar10. ,
(calculated Total_Loans/calculated Column_Total_Loans) as Dist_Loans
/*add more code to calculate Dist_Goods, Dist_Bads, etc..*/
from a;
quit;
/*Column totals only*/
proc sql;
create table c as
select
sum(goods)as Column_goods_tot format=dollar10. ,
sum(bads) as Column_bads_tot format=dollar10. ,
sum(calculated Column_goods_tot, calculated Column_bads_tot) as Column_Total_Loans format=dollar12.
from a;
quit;
I'm using PROC REPORT to generate a report of weighted sums. There are 2 columns that need to be summarized, both with the MEAN statistic. On top of that, I want to output the total weight.
I have 2 issues.
I cannot seem to get the title on each sum to reflect the variable
being summed.
I need a different format for each column.
Here is some sample data:
data test;
format lev1-lev3 $3. weight percent10.2 duration 6.2 convexity 6.4;
informat weight percent10.2 duration 6.2 convexity 6.4;
input lev1 lev2 lev3 weight duration convexity;
datalines;
A C H 16.11% 3.21 0.6182
A C I 3.83% 9.06 1.2244
A D J 7.67% 2.21 3.4010
A D K 16.90% 3.98 0.0303
B E L 2.68% 1.88 1.9515
B E M 16.68% 4.36 3.1851
B F N 20.79% 2.64 0.1145
B F O 15.34% 5.55 2.4408
;
run;
I've tried a number of ways to define things in PROC REPORT. Here is one of many:
proc report data=test nowd out=report;
column lev1 lev2 lev3 duration,(SUMWGT MEAN) convexity,(Mean);
weight weight;
define lev1 / group;
define lev2 / group;
define lev3 / group;
define duration / 'Duration' ;
define sumwgt / 'Weight' format=percent10.2;
define mean / '' format=6.2;
define convexity / 'Convexity';
*define mean / 'Convexity' format=6.4;
break before lev1 / summarize ;
break before lev2 / summarize ;
rbreak before / summarize;
run;
My ultimate goal would be something like:
Lev1 Lev2 Lev3 Weight Duration Convextiy
100.00% 3.88 1.3943
A 44.51% 3.83 0.9267
...
I've also played with PROC TABULATE but I am less of a fan of the tables it presents.
Example TABULATE mess:
PROC TABULATE DATA=WORK.test;
VAR duration convexity;
CLASS LEV1 / ORDER=UNFORMATTED MISSING;
CLASS LEV2 / ORDER=UNFORMATTED MISSING;
CLASS LEV3 / ORDER=UNFORMATTED MISSING;
TABLE
/* Row Dimension */
ALL={LABEL="+"}
LEV1*(
ALL={LABEL="+"}
LEV2*(
ALL={LABEL="+"}
LEV3 ) )
,
/* Column Dimension */
duration={LABEL="Weight"}*SumWgt={LABEL=""}*f=percent10.2
duration={LABEL="Duration"}*Mean={LABEL=""}*f=6.2
convexity={LABEL="Convexity"}*Mean={LABEL=""}*f=6.4;
WEIGHT weight;
RUN;
I think you'll have challenges getting exactly what you want from PROC REPORT. Maybe Cynthia#SAS could figure it out, I don't know, but getting the row headers right in particular will be extremely challenging.
I would suggest pre-processing the means (using PROC MEANS or similar) and then REPORTing that result. Very easy to do.
This may be close to what you want, for example:
proc means data=test;
class lev1 lev2 lev3;
var duration convexity;
weight weight;
types () lev1 lev1*lev2 lev1*lev2*lev3;
output out=test_out
sumwgt(duration)=sumwgt mean(duration)= mean(convexity)=;
run;
proc report data=test_out;
columns lev1-lev3 sumwgt duration convexity;
define lev1/order missing;
define lev2/order missing;
define lev3/order missing;
define sumwgt/display format=percent9.2;
define duration/display format=6.2;
define convexity/display format=6.4;
run;