Proc GLM with a list of binary dummy variables - sas

I am running a regression. My outcome (dependent) is a continuous variable. I have two types of independent variables. One represents day of week. The second type of independent variable is a binary variable (yes/no). I have about 40 of these binary variables. I am only interested in the interaction term between the day of week and all 40 binary variables in my model. I've searched online but could not find a great way to code it:
Sample Code:
proc glm
class dayofweek binvar1-binvar40
model outcome = dayofweek*binvar1 dayofweek*binvar2...dayofweek*binvar40/solution
run;
Is there an easier way to write this?

Not sure whether this counts as an easier solution :), but you can construct a macro variable IALL
DATA I;
DO i = 1 TO 40; OUTPUT; END;
RUN;
PROC SQL NOPRINT;
SELECT VAR into: IALL SEPARATED BY " " FROM (SELECT CATS("dayofweek*binvar",PUT(I,2.0)) AS VAR FROM I);
QUIT;
and use it in PROC GLM
proc glm
class dayofweek binvar1-binvar40
model outcome = &IALL. /solution
run;

Related

Creating table with cumulative values

I have table like first table on the picture.
It's information about banks deals on the FX market on daily basis (buy minus sell). I would like to calculate cumulative results like on the second table. The number of banks and their names, also as date are not fixed. I'm new in SAS and tried to find solutions, but didn't find anything useful. I will be glad for any help.
When data such as this is in a wide format, it can be more difficult to process in SAS compared to a long format. Long data formats have numerous benefits in the form of by-group processing, indexing, filtering, etc. Many SAS procedures are designed around this concept.
For more information on the examples below, check out SAS's example on the Program Data Vector and by-group processing. Mastering these concepts will help you with data step programming.
Here are two ways you can solve it:
1. Use a sum statement and by-group processing.
In this example, we will:
Convert the data from wide to long in order to convert the bank name to a character variable
Perform a cumulative sum on each bank
Convert back to long again
By converting the bank name into a character variable, we can use by-group processing on it.
/* Convert from wide to long */
proc transpose data=raw
out=raw_transposed
name=bank
;
by date;
run;
proc sort data=raw_transposed;
by bank date;
run;
/* Use by-group processing to get cumulative values by month for each bank */
data cumulative_long;
set raw_transposed;
by bank date;
/* Reset the cumulative sum for each bank */
if(first.bank) then call missing(cumulative);
cumulative+COL1;
run;
proc sort data=raw_transposed;
by date bank;
run;
/* Convert from long to wide */
proc transpose data=raw_transposed
out=want(drop=_NAME_)
;
by date;
id bank;
var COL1;
run;
The sum statement can be used as a shortcut of the following code:
data cumulative_long;
set raw_transposed;
by bank date;
retain cumulative;
if(first.bank) then cumulative = 0;
cumulative = cumulative + COL1;
run;
cumulative does not exist in the dataset: we are creating it here. This value will become missing whenever SAS moves on to read a new row. We want SAS to carry the last value forward. retain tells SAS to carry its last value forward until we change it.
2. Use macro variables and dictionary tables
A second option would be to read all of the bank names from a dictionary table to prevent transposing. We will:
Read the names of the banks from the special table dictionary.columns into a macro variable using PROC SQL
Use arrays to perform cumulative sums
This assumes the bank naming scheme is always prefixed with "Bank." If does not follow a regular pattern, you can exclude all other variables from the initial SQL query.
proc sql noprint;
select name
, cats(name, '_cume')
into :banks separated by ' '
, :banks_cume separated by ' '
from dictionary.columns
where memname = 'RAW'
AND libname = 'WORK'
AND upcase(name) LIKE 'BANK%'
;
quit;
data want;
set raw;
array banks[*] &banks.;
array banks_cume[*] &banks_cume.;
do i = 1 to dim(banks);
banks_cume[i]+banks[i];
end;
drop i;
run;

PROC FREQ on multiple variables combined into one table

I have the following problem. I need to run PROC FREQ on multiple variables, but I want the output to all be on the same table. Currently, a PROC FREQ statement with something like TABLES ERstatus Age Race, InsuranceStatus; will calculate frequencies for each variable and print them all on separate tables. I just want the data on ONE table.
Any help would be appreciated. Thanks!
P.S. I tried using PROC TABULATE, but it didn't not calculate N correctly, so I'm not sure what I did wrong. Here is my code for PROC TABULATE. My variables are all categorical, so I just need to know N and percentages.
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
The above code does not return the correct frequencies based on InsuranceStatus where 0 = insured and 1 = uninsured, but PROC FREQ does. Also doesn't calculate correctly with ROWPCTN. So any way that I can get PROC FREQ to calculate multiple variables on one table, or PROC TABULATE to return the correct frequencies, would be appreciated.
Here is a nice image of my output in a simplified analysis of only ERstatus and InsuranceStatus. You can see that PROC FREQ returns 204 people with an ERstatus of 1 and InsuranceStatus of 1. That's correct. The values in PROC TABULATE are not.
OUTPUT
I'll answer this separately as this is answering the other possible interpretation of the question; when it's clarified I'll delete one or the other.
If you want this in a single printed table, then you either need to use proc tabulate or you need to normalize your data - meaning put it in the form of variable | value. PROC FREQ is not capable of doing multiple one-way frequencies in a single table.
For PROC TABULATE, likely your issue is missing data. Any variable that is on the class statement will be checked for missingness, and if any rows are missing data for any of the class variables, those rows are entirely excluded from the tabulation for all variables.
You can override this by adding the missing option on the class statement, or in the table statement, or in the proc tabulate statement. So:
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus/missing;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
This will result in a slightly different appearance than on your table, though, as it will include the missing rows in places you probably do not want them, and they'll be factored against the colpctn when again you probably don't want them.
Typically some manipulation is then necessary; the easiest is to normalize your data and then run a tabulation (using PROC TABULATE or PROC FREQ, whichever is more appropriate; TABULATE has better percentaging options though) against that normalized dataset.
Let's say we have this:
data class;
set sashelp.class;
if _n_=5 then call missing(age);
if _n_=3 then call missing(sex);
run;
And we want these two tables in one table.
proc freq data=class;
tables age sex;
run;
If we do this:
proc tabulate data=class;
class age sex;
tables (age sex),(N colpctn);
run;
Then we get an N=17 total for both subtables - that's not what we want, we want N=18. Then we can do:
proc tabulate data=class;
class age sex/missing;
tables (age sex),(N colpctn);
run;
But that's not quite right either; I want F to have 8/18 = 44.44% and M 10/18 = 55.55%, not 42% and 53% with 5% allocated to the missing row.
The way I do this is to normalize the data. This means you get a dataset with 2 variables, varname and val, or whatever makes sense for your data, plus whatever identifier/demographic/whatnot variables you might have. val has to be character unless all of your values are numeric.
So for example here I normalize class with age and sex variables. I don't keep any identifiers, but you certainly could in your data, I imagine InsuranceStatus would be kept there if I understand what you're doing in that table. Once I have the normalized table, I just use those two variables, and carefully construct a denominator definition in proc tabulate to have the right basis for my pctn value. It's not quite the same as the single table before - the variable name is in its own column, not on top of the list of values - but honestly that looks better in my opinion.
data class_norm;
set class;
length val $2;
varname='age';
val=put(age,2. -l);
if not missing(age) then output;
varname='sex';
val=sex;
if not missing(sex) then output;
keep varname val;
run;
proc tabulate data=class_norm;
class varname val;
tables varname=' '*val=' ',n pctn<val>;
run;
If you want something better than this, you'll probably have to construct it in proc report. That gives you the most flexibility, but is the most onerous to program in also.
You can use ODS OUTPUT to get all of the PROC FREQ output to one dataset.
ods output onewayfreqs=class_freqs;
proc freq data=sashelp.class;
tables age sex;
run;
ods output close;
or
ods output crosstabfreqs=class_tabs;
proc freq data=sashelp.class;
tables sex*(height weight);
run;
ods output close;
Crosstabfreqs is the name of the cross-tab output, while one-way frequencies are onewayfreqs. You can use ods trace to find out the name if you forget it.
You may (probably will) still need to manipulate this dataset some to get the structure you want ultimately.

How to display analysis variable (0 - 1) as percentage using Proc Tabulate in SAS?

Suppose I have this data set like this:
Sample Data Set
And I would like to set up a table using proc tabulate such that it will look like:
Sample Tabulate
So far I have codes like this:
PROC TABULATE
DATA = EMPLOY;
CLASS RACE STATE;
VAR EMPLOYED;
TABLE RACE*STATE, N EMPLOYED*SUM EMPLOYED*PCTSUM<TABLE*RACE>;
RUN;
But it doesn't seem to give me what I want, is there anyway to fix it? I know it is odd using 0-1 and treat employed as an analysis variable for this but my boss doesn't want any 'N' (0) columns, just the 'Y' (1) columns.
Thank you!
Data= is an option for the Proc statement, not a separate statement.
PROC TABULATE DATA = EMPLOY;
You correctly identify CLASS and VAR variables
CLASS RACE STATE;
VAR EMPLOYED;
and as boolean TRUE is coded as 1, you can sum it up to count the number employed, but SUMPCT gives you the column percent, i.e how many employed subjects are ASIANs living in NC?
To calculate the fraction employed per race and state, you can average the boolean:
TABLE RACE*STATE, N EMPLOYED*(SUM MEAN);
RUN;
And this gives you the required data, though ugly formatted.

SAS sum variables using name after a proc transpose

I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.

Using Tabulate for 3-way table

I am trying to output a three way frequency table. I am able to do this (roughly) with proc freq, but would like the control for variable to be joined. I thought proc tabulate would be a good way to customize the output. Basically I want to fill in the cells with frequency, and then customize the percents at a later time. So, have count and column percent in each cell. Is that doable with proc tabulate?
Right now I have:
proc freq data=have;
table group*age*level / norow nopercent;
run;
that gives me e.g.:
What I want:
Here is the code I am using:
proc tabulate data=ex1;
class age level group;
var age;
table age='Age Category',
mean=' '*group=''*level=''*F=10./ RTS=13.;
run;
Thanks!
You can certainly get close to that. You can't really get in 'one' cell, it needs to write each thing out to a different cell, but theoretically with some complex formatting (probably using CSS) you could remove the borders.
You can't use VAR and CLASS together, but since you're just doing percents, you don't need to use MEAN - you should just use N and COLPCTN. If you're dealing with already summarized data, you may need to do this differently - if so then post an example of your dataset (but that wouldn't work in PROC FREQ either without a FREQ statement).
data have;
do _t = 1 to 100;
age = ceil(3*rand('Uniform'));
group = floor(2*rand('Uniform'));
level = floor(5*rand('Uniform'));
output;
end;
drop _t;
run;
proc tabulate data=have;
class age level group;
table age='Age Category',
group=''*level=''*(n='n' colpctn='p')*F=10./ RTS=13.;
run;
This puts N and P (n and column %) in separate adjacent cells inside a single level.