SAS - Totaling values in a column by a different column's value - sas

The title may be a little ambiguous. Essentially, using the SASHELP.SHOES dataset, I'm trying to summarize the data in a new table by totaling the Sales, and Returns for each region. For instance, instead of having 56 rows for shoes sold in Africa and their individual sales/returns values, I have one row for Africa with columns TotalSales and TotalReturns. I need to do this for each region in the original dataset.
I'm not familiar at all with SAS, this is more or less the first thing I've really had to program in it. I've tried a few variations of data steps with IN or WHERE conditions, proc means steps with SUM() statements, and DO/DO WHILE loops, but I've missed something each time.

In Proc MEANS
Use a CLASS statement to specify which variable(s) are to be used to group the data. In your case REGION.
Use the VAR statement to specify which variable(s) are to have statistics calculated for within each grouping.
Default output
Corresponding to the minimal syntax
ods listing;
proc means noprint data=SASHELP.SHOES;
class region;
var sales returns;
output out=shoes_stats;
run;
Creates data set WORK.SHOES_STATS with one row per statistic per region.
Other output structure
Use procedure option NWAY to only get summarizations for combinations of all the CLASS variables. (In your case this corresponds to rows with _TYPE_=1)
The output columns can have the statistic name automatically concatenated to the variable name using the OUTPUT statement option / autoname.
Use data set options to control variables that are kept or dropped.
proc means nway noprint data=SASHELP.SHOES;
class region;
var sales returns;
output out=shoes_sums(drop=_type_ _freq_) sum= / autoname;
run;
dm 'vt shoes_sums; column names' viewtable;

Related

SAS Proc Tabulate output dataset variable names

I am using proc tabulate to create an output dataset with statistics (n mean std min max p25 p75 median) for a variable with a long name (close to the 32 character maximum). The output dataset will add _n, _std, etc to our variable name, but the median variable is just named "Median" because the variable name with "_median" added to the end, the resulting variable name would be >32 characters.
Is there a way to specify the name of the variables in the output dataset from within the proc tabulate step? I am looping through 1000s of variable for this procedure, so it's not feasible to rename each variable in a data step. Also, it must be proc tabulate and not proc freq because we need to output a row for every possible value of each variable, not just those values that exist in the data.
proc tabulate data=DATA out=OUT ;
var VERY_LONG_VARIABLE_NAME;
table VERY_LONG_VARIABLE_NAME *(n mean std min max p25 p75 median)/printmiss;
run;
Unfortunately I don't know of a way to override the tabulate names. Even transposing the tabulate doesn't fix that - you still get the same result, sadly.
My suggestion is to use a different proc. Almost all of the procs you might use have a way to get what you want - the PRINTMISS equivalent; for example, PROC FREQ has the SPARSE option which does basically the same thing (despite its odd name), and PROC SUMMARY or PROC MEANS might be even better (with COMPLETETYPES on the class statement), just depending on your data.
Alternately, you could reshape your data, or reshape your process. For example, if you're really looping through thousands of variables, that's horribly inefficient; better would be to reshape to variable|value structure (vertical) and then do one proc tabulate; that would fix your issue right there (as it would make 'varname' be a CLASS or BY variable itself not a contributor to the output variable name) and make your process faster.
You could also add a VIEW step before the tabulate that performs the rename for you; that would cost very little even in a macro loop.
Either way, supply some sample data and an example of the total process you're doing and likely you can get a better answer.

SAS Proc Freq - frequency of each category for multiple variables

How can I produce a table that has this kind of info for multiple variables:
VARIABLE COUNT PERCENT
U 51 94.4444
Y 3 5.5556
This is what SAS spits out into the listing output for all variables when I run this program:
ods output nlevels=nlevels1 OneWayFreqs=freq1 ;
proc freq data=sample nlevels ;
tables _character_ / out=outfreq1;
run;
In the outfreq1 table there is the info for just the last variable in the data set (table shown above) but not for all for the variables.
In the nlevels1 table there is info of how many categories each variable has but no frequency data.
What I want though is to output the frequency info for all the variables.
Does anybody know a way to do this without a macro/loop?
You basically have two options, which are sort-of-similar in the kinds of problems you'll have with them: use PROC TABULATE, which more naturally deals with multiple table output, or use the onewayfreqs output that you already call for.
The problem with doing that is that variables may be of different types, so it doesn't have one column with all of that information - it has a pair of columns for each variable, which obviously gets a bit ... messy. Even if your variables are all the same type, SAS can't assume that as a general rule, so it won't produce a nice neat thing for you.
What you can do, though, particularly if you are able to use the formatted values (either due to wanting to, or due to them being identical!), is coalesce them into one result.
For example, given your freq1 dataset from the above:
data freq1_out;
set freq1;
value = coalesce(of f_:);
keep table value frequency percent;
run;
That combines the F_ variables into one variable (as always only one is ever populated). If you can't use the F_ variables and need the original ones, you will have to make your own variable list using a macro variable list (or some other method, or just type the names all out) to use coalesce.
Finally, you could probably use PROC SQL to produce a fairly similar table, although I probably wouldn't do it without using the macro language. UNION ALL is a handy tool here; basically you have separate subqueries for each variable with a group by that variable, so
proc sql;
create table my_freqs as
select 'HEIGHT' as var, height, count(1) as count
from sashelp.class
group by 1,height
union all
select 'WEIGHT' as var, weight, count(1) as count
from sashelp.class
group by 1,weight
union all
select 'AGE' as var, age, count(1) as count
from sashelp.class
group by 1,age
;
quit;
That of course can be trivially macrotized to something like
proc sql;
create table my_freqs as
%freq(table=sashelp.class,var=height)
union all
%freq(table=sashelp.class,var=weight)
union all
%freq(table=sashelp.class,var=age)
;
quit;
or even further either with a list processing or a macro loop.

PROC FREQ - How to find out the dataset name

I have a program where few datasets are created dynamically through a macro. Datasets are also named dynamically.
For example:
STAT_FILE_1,
STAT_FILE_2
Sometimes my macro may create two datasets,
STAT_FILE_1,
STAT_FILE_2
and sometimes, three
STAT_FILE_1,
STAT_FILE_2,
STAT_FILE_3
and the number may vary depending on the source data.
I am using PROC FREQ to have a summary data for the last/recent dataset.
Hence I am using the below code
PROC FREQ;
tables YEAR;
run;
I am getting the result, however i am not able to find out the dataset name used. Can some please help me in finding the dataset name which is used in PROC FREQ.
This will show you the last table created:
%put &syslast;

How to merge multiple datasets by common variable in SAS?

I have multiple datasets. Each of them has different number of attributes. I want to merge them all by common variable. This is 'union' if I use proc SQL. But there is hunderds of variables.
Example.
Dataset_Name Number of columns
dataset1 110
dataset2 120
dataset3 130
... ...
Say they have 100 columns in common. The final dataset which contains all dataset1,dataset2,dataset3..etc
only has common columns(in this case, 100 columns).
How do I do this?
And how do I get columns for each dataset this is not in common with the final dataset.
example: dataset1 will have 10 columns that are not in the final dataset, and list the name of 10 columns.
Thanks!!!!
UNION in SQL is equivalent to sequential SET in SAS.
data want;
set dataset1 dataset2 dataset3;
run;
Now, SAS by default includes all columns present in any dataset. To limit to just what's in all datasets, you have to use a keep statement.
You can determine this using proc sql, among other ways.
proc sql;
select name into :commonlist separated by ' '
from dictionary.columns C, dictionary.columns D
where C.libname=D.libname
and C.memname='DATASET1'
and D.memname='DATASET2'
and C.name=D.name
;
quit;
For more than two datasets it's more complicated and partially depends on your, but if you're comfortable in SQL you can figure that out pretty easily. A similar construct can create a list of just dataset 1 variables. The important part is the into :commonlist separated by ' ', which says to pull the select results into a macro variable called commonlist, separating rows by space. (The colon says to create a macro variable, not a table.)
So you can then run:
data want (keep=&commonlist.) dset1(keep=&dset1list.) dset2(keep=&dset2list.);
set dataset1(in=ds1) dataset2(in=ds2) dataset3(in=ds3);
output want;
if ds1 then output dset1;
else if ds2 then output dset2;
else if ds3 then output dset3;
run;
The in=xyz indicates which dataset a row came from. Each output dataset can have a separate list of variables to keep. You might want to keep the ID variable in those other datasets as well.
I will say that usually in SAS you don't do what you're doing here: it's not easy to do because it doesn't tend to be the best way to handle things - specifically, the little split off datasets. In general you would just keep those extra variables on the master dataset, and they'd just be nulls for anyone not in a dataset with that variable - assuming it makes sense to make this 'master' dataset at all.

error message box cox in proc transreg procedure in sas

I try to use the proc transreg procedure in SAS, to transform one of my variables in a dataset (var1). The var1 variable has values >=0.
My code is:
proc transreg data=data1 details;
model boxcox(var1/lambda=-1 to 1 by 0.125 convenient parameter=1)=identity(var2);
output out=BoxCox_Out;
run;
However I get the following error message:
"observation of nonblank TYPE not equal 'Score ' are excluded from the analysis and the output data set.
Could anyone help me?
_TYPE_ can be used for TRANSREG to allow you to take datasets with multiple kinds of rows and only use the SCORE rows (or whichever ones you choose), often outputs from earlier TRANSREG procedures.
However, _TYPE_ is also a common variable added by procedures like PROC MEANS to indicate which class combinations apply to the row. In this case, TRANSREG is getting confused and thinking you want something different.
Drop the _TYPE_ variable in the TRANSREG data source statement, and it should use all rows.
proc transreg data=data1(drop=_type_) details;