I'm using the boxcox transformation in SAS with the proc transreg procedure, and I was wondering how does SAS handle missing data.
I have a dataset that includes one row per month per participant, with a continuous variable every month. For some months, the variable is missing. The formula of the Box-Cox transformation doesn't use the distribution of the variable or whatever. How is SAS working, does it exclude the missing data?
Below is my code to apply the boxcox transformation to my variable:
PROC TRANSREG DATA=myfile DETAILS;
MODEL BOXCOX(myvariable/ parameter=0.1) = identity(month);
OUTPUT OUT= transformed_myfile;
RUN;
Thanks!
From the documentation:
PROC TRANSREG can estimate missing values, with or without category or monotonicity constraints, so that the regression model fit is optimized. Several approaches to missing data handling are provided. All observations with missing values in IDENTITY, CLASS, POINT, EPOINT, QPOINT, SMOOTH, PBSPLINE, PSPLINE, and BSPLINE variables are excluded from the analysis. When METHOD=UNIVARIATE (specified in the PROC TRANSREG or MODEL statement), observations with missing values in any of the independent variables are excluded from the analysis. When you specify the NOMISS a-option, observations with missing values in the other analysis variables are excluded. Otherwise, missing data are estimated, and the variable means are the initial estimates.
(Emphasis added). You can add various transformations as you prefer, or go with SAS's default estimates.
Related
I am handling a SAS dataset with little observations and I need to put on relation a variable with its lag.
By doing this, I lose a record resulting in a missing values.
Do some of you know how SAS Base handles such items in the procedures as PROC REG or PROC CORR?
Thanks you all in advance.
It depends a bit on what exactly you're doing. Based on your references to PROC REG and CORR - it excludes the missing values - listwise. This means if any value is missing in a row that is being used or referenced in the PROC it will be excluded.
http://documentation.sas.com/?docsetId=lrcon&docsetTarget=p175x77t7k6kggn1io94yedqagl3.htm&docsetVersion=9.4&locale=en
PROC CORR has several options that allow you to specify how it handles the missing values. The NOMISS option tells SAS to use only complete cases.
https://blogs.sas.com/content/iml/2012/01/13/missing-values-pairwise-corr.html
I have the following code:
ods tagsets.excelxp file = 'G:\CPS\myworkwithoutmissing.xml'
style = printer;
proc tabulate data = final;
Class Year Self_Emp_Inc Self_Emp_Uninc Self_Emp Multi_Job P_Occupation Full_Part_Time_Status;
table Year, P_Occupation*n;
table Year, (P_Occupation*Self_Emp_Inc)*n;
table Year, (Self_Emp_Inc*P_Occupation)*n;
run;
ods tagsets.excelxp close;
When I run this code, I get the following error message:
WARNING: A class, frequency, or weight variable is missing on every observation.
WARNING: A class, frequency, or weight variable is missing on every observation.
WARNING: A class, frequency, or weight variable is missing on every observation.
Now in order to circumvent this issue, I add the "missing" option at the end of the class statement such that:
class year self_emp_inc ....... Full_Part_Time_Status/ missing;
This fixes the problem in that it doesn't give me the error message and creates the table. However, my chart now also counts the number of missing values, something that I do not want. For example my variable self_emp_inc has values of 1 and .(for missing). Now when I run the code with the missing option,I get a count of P_Occupation for all the missing values as well, but I only want the count for when the value of self_emp_Inc is 1. How can I accomplish that task?
This is one of those frustrating things in SAS that for some reason SAS hasn't given us a "good" option to work around. Depending on what you're working with, there are a few solutions.
The real problem here is not that you have missings - in a 1x1 table (1 var by 1 var), excluding missings is what you want. It's because you're calling for multiple tables and each table is affected by missings in the class variables in the other table.
As such, oftentimes the easiest answer is simply to split the tables into multiple proc tabulate statements. This might occasionally be too complicated or too onerous in terms of runtime, but I suspect the majority of the time this is the best solution - it often is for me, anyway.
Since you're only working with n, you could instead construct the tabulation with the missings, output to a dataset, then filter them out and re-print or export that dataset. That's the easiest solution, typically.
How exactly you want to do this of course depends on what exactly you want. For example:
data test_cars;
set sashelp.cars;
if _n_=5 then call missing(make);
if _n_=7 then call missing(model);
if _n_=10 then call missing(type);
if _n_=13 then call missing(origin);
run;
proc tabulate data=test_cars out=test_tabulate(rename=n=count);
class make model type origin/missing;
tables (make model type),origin*n;
run;
data test_tabulate_want;
set test_tabulate;
if cmiss(of make model type origin)>2 then delete;
length colvar $200;
colvar = coalescec(of make model type);
run;
proc tabulate data=test_tabulate_want missing;
class colvar origin/order=data;
var count;
tables colvar,origin*count*sum;
run;
This isn't perfect, though it can be made a lot better with some more work on the formatting - this is just a quick example.
If you're using percents, of course, this doesn't exactly work. You either need to refactor the percents in that data step - which is a bit of work, but doable - or you need separate tabulates for each class variable.
I try to use the proc transreg procedure in SAS, to transform one of my variables in a dataset (var1). The var1 variable has values >=0.
My code is:
proc transreg data=data1 details;
model boxcox(var1/lambda=-1 to 1 by 0.125 convenient parameter=1)=identity(var2);
output out=BoxCox_Out;
run;
However I get the following error message:
"observation of nonblank TYPE not equal 'Score ' are excluded from the analysis and the output data set.
Could anyone help me?
_TYPE_ can be used for TRANSREG to allow you to take datasets with multiple kinds of rows and only use the SCORE rows (or whichever ones you choose), often outputs from earlier TRANSREG procedures.
However, _TYPE_ is also a common variable added by procedures like PROC MEANS to indicate which class combinations apply to the row. In this case, TRANSREG is getting confused and thinking you want something different.
Drop the _TYPE_ variable in the TRANSREG data source statement, and it should use all rows.
proc transreg data=data1(drop=_type_) details;
This might be a weird question. I have a data set contains data like agree, neutral, disagree...for many questions. There is not so many observations so for some question, one or more options has frequency of 0, say neutral. When I run proc freq, since neutral shows up in that variable, the table does not contain a row for neutral. I end up with tables with different number of rows. I would like to know if there is a option to show these 0 frequency rows. I will also need to run proc gchart for the same data set, and I will run into the same problem for having different number of bars. Please help me on this. Thank you!
This depends on how exactly you are running your PROC FREQ. It has the sparse option, which tells it to create a value for every logical cell on the table when creating an output dataset; normally, while you would have a cell with a missing value (or zero) in a crosstab, if that is output to a dataset (which is vertical, ie each combination of x and y axis value are placed in one row) those rows are left off. Sparse makes sure that doesn't happen; and in a larger (n-dimensional) crosstab, it creates rows for every possible combination of every variable, even ones that don't occur in the data.
However, if you're just doing
proc freq data=mydata;
tables myvar;
run;
That won't help you, as SAS doesn't really have anything to go on to figure out what should be there.
For that, you have to use a class variable procedure. Proc Tabulate is one of such procedures, and is similar to Proc Freq in its syntax (sort of). You need to either use CLASSDATA on the proc statement, or PRINTMISS on the table statement. In the former case, you do not need to use a format, I don't believe. In the latter case (PRINTMISS), you need to create a format for your variable (if you don't already have one) that contains all levels of the data that you want to display (even if it's just an identity format, e.g. formatting character strings to identical character strings), and specify PRELOADFMT on the proc statement. See this man page for more details.
Background: I have a categorical variable, X, with four levels that I fit as separate dummy variables. Thus, there are three total dummy variables representing x=1, x=2, x=3 (x=0 is baseline).
Problem/issue: I want to be able to calculate the value of a linear combination (i.e. using SAS as a calculator) of these dummy variables. For example, 2*B1 + 2*B2 + B3.
In Stata, this can be done using the lincom command, which uses the stored beta estimates to calculate linear combinations of the parameters.
In SAS in a procedure such as PROC GLM, I think I should use the ESTIMATE statement, but I'm not sure how I would specify the "weights" for each variable in this case.
You are looking for PROC SCORE. This takes output regression or factor estimates and scores a new data set. See here for an example. http://support.sas.com/documentation/cdl/en/statug/66859/HTML/default/viewer.htm#statug_score_examples02.htm
FYI, PROC MODEL does allow this in the model statement, which may be less work than PROC SCORE. I know PROC MODEL can be used readily in place of PROC REG, but I'm not sure how advanced of modeling PROC MODEL does, so it may not be an option for more complex models. I was hoping for something with less coding, but given the nature of SAS, I think this and PROC SCORE are the best I'm going to get.
What if you add your linear combination as a variable in your input dataset?
data myDatasetWithLinCom;
set mydata;
LinComb=2*(x=1)+ 2*(x=2)+(x=3); /*equvilent to 2*B1 + 2*B2 + B3*/
run;
then you can specify LinComb as one of the explanatory variables and you can lookup the coefficient directly from the output.