How to aggregate number of occure each value per column in table in SAS Enterprise Guide? - sas

I have table in SAS Enterprise Guide like below:
COL1 | COL2 | COL3
-----|-------|------
111 | A | C
111 | B | C
222 | A | D
333 | A | D
And I need to aggregate abve table to know how many each value in columns occured, so as to have something like below:
COL2_A | COL2_B | COL3_C | COL3_D
--------|--------|--------|--------
3 | 1 | 2 | 2
Because:
COL_2A = 3, because in COL2 value "A" exists 3 times
and so on...
How can I do that in SAS Enterprise Guide or in PROC SQL ?
I need the output as SAS dataset

Try this
data have;
input COL1 COL2 $ COL3 $;
datalines;
111 A C
111 B C
222 A D
333 A D
;
data long;
set have;
array col COL2 COL3;
do over col;
c = col;
n = cats(vname(col), '_', c);
output;
end;
run;
proc summary data = long nway;
class n;
output out = freq(drop = _TYPE_);
run;
proc transpose data = freq out = wide_freq(drop = _:);
id n;
run;

Related

How to aggretage col1 per ID and val1 per ID and values in col1 in SAS Enterprise Gude or PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL1 | VAL1 |
----|------|------|
111 | A | 10 |
111 | A | 5 |
111 | B | 10 |
222 | B | 20 |
333 | C | 25 |
... | ... | ... |
And I need to aggregate above table to know:
sum of values from COL1 per ID
sum of values from VAL1 per COL1 per ID
So, as a result I need something like below:
ID | COL1_A | COL1_B | COL1_C | COL1_A_VAL1_SUM | COL1_B_VAL1_SUM | COL1_C_VAL1_SUM
----|--------|--------|---------|-----------------|-----------------|------------------
111 | 2 | 1 | 0 | 15 | 10 | 0
222 | 0 | 1 | 0 | 0 | 20 | 0
333 | 0 | 0 | 1 | 0 | 0 | 25
for example because:
COL1_A = 2 for ID 111, because ID=111 has 2 times "A" in COL1
COL1_A_VAL1_SUM = 15 for ID 111, because ID=111 has 10+5=15 in VAL1 for "A" in COL1
How can I do that in SAS Enterpriuse Guide or in PROC SQL ?
First, we'll create the counts that we need by group with SQL:
proc sql;
create table totals_by_group as
select id
, col1
, count(col1) as count_col1
, sum(val1) as sum_val1
from have
group by id, col1
;
quit;
This produces the following table:
id col1 count_col1 sum_val1
111 A 2 15
111 B 1 10
222 B 1 20
333 C 1 25
Now we need to transpose this into the way we want it. We'll do this with two transpose steps: one for count_col1, and one for sum_val1. proc transpose has a few handy options to make this easy, namely the id, prefix, and suffix options.
First, we'll consider our ID variable col1. This creates columns named A, B, and C. For example:
id A B C
111 2 1 .
222 . 1 .
333 . . 1
The prefix and suffix options let us add a prefix and suffix to these names.
proc transpose
data = totals_by_group
out = count_by_group(drop=_NAME_)
prefix = COL1_;
by id;
id col1;
var count_col1;
run;
proc transpose
data = totals_by_group
out = sum_by_group(drop=_NAME_)
prefix = COL1_
suffix = _VAL1_SUM;
by id;
id col1;
var sum_val1;
run;
This gives us two tables:
COUNT_BY_GROUP
id COL1_A COL1_B COL1_C
111 2 1 .
222 . 1 .
333 . . 1
SUM_BY_GROUP
id COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 15 10 .
222 . 20 .
333 . . 25
Now we just need to merge them together, then set all missing values to 0 by iterating over each numeric column and checking if it's missing.
data want;
merge count_by_group
sum_by_group
;
by id;
array numvars[*] _NUMERIC_;
do i = 1 to dim(numvars);
if(missing(numvars[i])) then numvars[i] = 0;
end;
drop i;
run;
Final table:
id COL1_A COL1_B COL1_C COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 2 1 0 15 10 0
222 0 1 0 0 20 0
333 0 0 1 0 0 25

How to delete variables with huge percent of missings in table in SAS?

I have table in SAS with missing values like below:
col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111 | | abc | ... | abc
222 | 11 | C1 | ... | 11
333 | 18 | | ... | 12
... | ... | ... | ... | ...
And I need to delete from above table variables where is more than 80% missing values (>=80%).
How can I do taht in SAS ?
The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.
%macro get_missing_vars(lib=, dsn=, threshold=);
%global drop_vars;
/* Generate a select statement that calculates the proportion missing:
nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */
proc sql noprint;
select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
into :calculate_pct_missing separated by ','
from dictionary.columns
where libname = upcase("&lib")
AND memname = upcase("&dsn")
;
quit;
/* Calculate the percent missing */
proc sql;
create table pct_missing as
select &calculate_pct_missing.
from &lib..&dsn.
;
quit;
/* Convert to a long table */
proc transpose data=pct_missing out=drop_list;
var _NUMERIC_;
run;
/* Get a list of variables to drop that are >= the drop threshold */
proc sql noprint;
select _NAME_
into :drop_vars separated by ' '
from drop_list
where COL1 GE &threshold.
;
quit;
%mend;
It has three parameters:
lib: Library of your dataset
dsn: Dataset name without the library
threshold: Proportion of missing values a variable must meet or exceed to be dropped
For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.
data have;
array col[10];
do i = 1 to 10;
do j = 1 to 10;
col[j] = i;
if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
end;
output;
end;
drop i j;
run;
We'll run the macro and check the log:
%get_missing_vars(lib=work, dsn=have, threshold=0.8);
%put &drop_vars;
The log shows:
col1 col2 col3
Now we can pass this into a simple data step.
data want;
set have;
drop &drop_vars;
run;

Proc means output statement

I have a dataset with several variables like the one below:
Data have(drop=x);
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
Run;
In my analysis I need to get some summary stats using PROC MEANS and output the results for each variable into one dataset. I tried doing it with the code below, but it only includes stats from the first variable in the dataset. How can I output the remaining variables into the same dataset?
Proc means data=have n sum mean;
By group;
Output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
Run;
Output:
+-------+----+----------+----------+
| group | n | sum | mean |
+-------+----+----------+----------+
| A | 10 | 4.517081 | 0.451708 |
+-------+----+----------+----------+
| B | 10 | -0.77369 | -0.07737 |
+-------+----+----------+----------+
Desired output:
+----------+-------+----+----------+----------+
| variable | group | n | sum | mean |
+----------+-------+----+----------+----------+
| var1 | A | 10 | 4.517081 | 0.451708 |
+----------+-------+----+----------+----------+
| var1 | B | 10 | -0.77369 | -0.07737 |
+----------+-------+----+----------+----------+
| var2 | A | 10 | 7.947089 | 0.794709 |
+----------+-------+----+----------+----------+
| var2 | B | 10 | 5.003049 | 0.500305 |
+----------+-------+----+----------+----------+
You requested SAS to name the count n, the sum sum and the mean mean.
It can only do that for one variable.
This is the syntax to ask SAS to use different names for the statistics of each variable:
Output out=want(drop=_freq_ _type_)
n(var1 var2)=n1 n2
sum(var1 var2)=sum1 sum2
mean(var1 var2)=mean1 mean2;
To get that output you will need to transpose the data. Either transpose before hand and add the _NAME_ variable to the BY or CLASS statement.
data have;
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
run;
proc transpose data=have out=tall;
by group x;
run;
proc means data=tall nway n sum mean;
by group;
class _name_;
output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
run;
Or use /autoname and transpose the resulting dataset from one observation per GROUP to multiple observations.
proc means data=have(drop=x) nway n sum mean;
by group;
output out=wide(drop=_freq_ _type_) n= sum= mean= /autoname;
run;
proc transpose data=wide out=tall;
by group;
run;
data tall ;
set tall ;
stat=scan(_name_,-1,'_');
_name_=substrn(_name_,1,length(_name_)-length(stat) -1);
rename _name_=varname;
run;
proc sort data=tall;
by group varname;
run;
proc transpose data=tall out=want(drop=_name_);
by group varname ;
id stat;
var col1;
run;
proc print data=want;
run;

Rename all variable sas

I have a database where 12,000 variables are named "A0122_40", "A0122_45", "A0122_50" and so on. I would like to rename them by keeping in the initial name the numbers from 2 to 5. I would then like to create variables adding all the columns with the same name.
CASE 1
If you want to join two tables with same column names, and keep all data:
data class;
set sashelp.class (obs=2);
rename name=A0122_40
Sex=A0122_45
Weight=A0122_50
Height=A0122_55
;run;
Prepared Table:
+----------+----------+-----+----------+----------+
| A0122_40 | A0122_45 | Age | A0122_55 | A0122_50 |
+----------+----------+-----+----------+----------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+----------+----------+-----+----------+----------+
Code to rename:
%macro renameCols(lb , ds);
proc sql noprint;
select name
into :rn_vr1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A0122%'
;
%let num_vars = &sqlobs.;
proc datasets library=&lb nolist nodetails nowarn;
modify &ds;
rename
%do i=1 %to &num_vars;
&&rn_vr&i=new_&&rn_vr&i
%end;
;
%mend renameCols;
%renameCols(work,class);
Result Table:
+--------------+--------------+-----+--------------+--------------+
| new_A0122_40 | new_A0122_45 | Age | new_A0122_55 | new_A0122_50 |
+--------------+--------------+-----+--------------+--------------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+--------------+--------------+-----+--------------+--------------+
As you see, all columns were renamed, exclude Age.
CASE 2
If you want to "append" all A0122_... in one , I suggest the next code:
data class;
set sashelp.class (obs=2);
rename name=A0123_40
Sex=A0123_45
Weight=A0122_50
Height=A0121_55
;
run;
Prepared Table:
+----------+----------+-----+----------+----------+
| A0123_40 | A0123_45 | Age | A0121_55 | A0122_50 |
+----------+----------+-----+----------+----------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+----------+----------+-----+----------+----------+
Code to rename:
%macro renameCols(lb , ds);
%macro d;
%mend d;
proc sql noprint;
select name
into :rn_vr1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A012%'
;
%let num_vars = &sqlobs.;
/*Create a lot of tables with one column from one input*/
data
%do i=1 %to &num_vars;
&&rn_vr&i (rename=(&&rn_vr&i=%scan(&&rn_vr&i,1,_)) keep = &&rn_vr&i )
%end;
;
set &lb..&ds.;
run;
/*Count variable patterns and write it to macro*/
proc sql noprint;
select distinct scan(name,1,'_')
into :aggr_vars1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A012%'
;
%let num_aggr_vars=&sqlobs.;
/*Append all tables that contains same column name pattern*/
%do i=1 %to &num_aggr_vars;
data _&&aggr_vars&i;
set &&aggr_vars&i:;
n=_n_;
run;
%end;
/*Merge that tables into one*/
data res (drop= n);
merge
%do i=1 %to &num_aggr_vars;
_&&aggr_vars&i
%end;
;
by n;
run;
run;
%mend renameCols;
%renameCols(work,class);
The res table:
+-------+-------+--------+
| A0121 | A0122 | A0123 |
+-------+-------+--------+
| 69 | 112.5 | Alfred |
| 56.5 | 84 | Alice |
| . | . | M |
| . | . | F |
+-------+-------+--------+
Is this what you are looking for:
proc contents data=have out=cols noprint;
run;
proc sql noprint;
select distinct substr(name,1,6) into :colgrps separated by " " from cols;
run;
%macro process;
data want;
set have;
%let ii = 1;
%do %while (%scan(&colgrps, &ii, %str( )) ~= );
%let grp = %scan(&colgrps, &ii, %str( ));
&grp._sum = sum(of &grp.:);
%let ii = %eval(&ii + 1);
%end;
run;
%mend;
%process;

Proc REPORT move group value (row header) closer to totals

I have some data that is structured as below. I need to create a table with subtotals, a total column that's TypeA + TypeB and a header that spans the columns as a table title. Also, it would be ideal to show different names in the column headings rather than the variable name from the dataset.
I cobbled together some preliminary code to get the subtotals and total, but not the rest.
data tabletest;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
;
run;
Code that I wrote:
proc report data=tabletest nofs headline headskip;
column referral_total referral_source TypeA TypeB;
define referral_total / group ;
define referral_source / group;
define TypeA / sum ' ';
define TypeB / sum ' ';
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
run;
This is the desired output:
The DEFINE statement has an option NOPRINT that causes the column to not be rendered, however, the variables for it are still available (in a left to right manner) for use in a compute block.
Stacking in the column statement allows you to customize the column headers and spans. In a compute block for non-group columns, the Proc REPORT data vector only allows access to the aggregate values at the detail or total line, so you need to specify .
This sample code shows how the _total column is hidden and the _source cells in the sub- and report- total lines are 'injected' with the hidden _total value. The _source variable has to be lengthened to accommodate the longer values that are in the _total variable.
data tabletest;
* ensure referral_source big enough to accommodate _total || ' TOTAL';
length referral_total $50 referral_source $60;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
run;
proc report data=tabletest;
column
( 'Table 1 - Stacking gives you custom headers and hierarchies'
referral_total
referral_source
TypeA TypeB
TypeTotal
);
define referral_total / group noprint; * hide this column;
define referral_source / group;
define TypeA / sum 'Freq(A)'; * field labels are column headers;
define TypeB / sum 'Freq(B)';
define TypeTotal / computed 'Freq(ALL)'; * specify custom computation;
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
/*
* no thanks, doing this in the _source compute block instead;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
*/
compute referral_source;
* the referral_total value is available because it is left of me. It just happens to be invisible;
* at the break lines override the value that appears in the _source cell, effectively 'moving it over';
select (_break_);
when ('referral_total') referral_source = catx(' ', referral_total, 'Total');
when ('_RBREAK_') referral_source = 'Total';
otherwise;
end;
endcomp;
compute TypeTotal;
* .sum is needed because the left of me are groups and only aggregate values available here;
TypeTotal = Sum(TypeA.sum,TypeB.sum);
endcomp;
run;