SAS: choosing the right model - sas

I'm getting a bit lost in alle the possible ways to find association...
I have a dataset in which my subjects are categorized 1, 2 or 3 (depending on genotype polymorphisms). I want to know the association of either polymorphism with ballistic strength (which is a continuous variable).
Since I have one continous independent variable (strength) and one categorical dependant variable (genotype pholymorphism), I taught of using ANOVA, but I'm not sure which test to choose and how to find the exact one in SAS.
Thanks in advance!

This is better asked on Cross Validated (https://stats.stackexchange.com/).
That said, I would look at PROC GENMOD or PROC GLM to get an idea of how the genotypes affect the strength variable.
proc genmod data=test;
class genotype ;
model strength = genotype / type1;
run;
You can use PROC TTEST to compare any two levels. Here I compare A to B.
proc ttest data=test(where=(genotype in ('A', 'B')));
class genotype;
var strength;
run;

Related

Kruskal-Wallis test vs. ANOVA in SAS with complex survey data?

I am analyzing a temporal trend(yr) of certain chemicals(a b & c).
I use proc sgplot and series statement to draw a plot and found there was a decreasing trend.
Becuase the data is right-skewed, I used the median concentration of each year to draw the plot.
Now I would like to conduct a statistical test on the trend. My data came from the NHANES and need to use the proc survey** to perform analysis. I know I can do an ANOVA test based on proc surveyreg and use ANOVA option in the MODELstatement.
proc suveyreg data=a;
stratum stra;
cluster clus;
weight wt;
model a=yr/anova;
run;
But since the original data is right-skewed, I think maybe it is better to use Kruskal-Wallis test on the original data. But I don't know how to write a code in SAS and I didn't find information in proc survey**-related document.
My plan B is to use the log-transformed data and ANOVA test. But I am not sure if that is an appropriate approach. Can somebody tell me how to get the normality test of the residual in ANOVA while using proc surveyreg? I would also like to know if I can test a b & c in one procedure or I should write multiple procedures with changes in MODEL statement.
Looking forward to your engagement.Thank you!

proc tabulate missing values SAS

I have the following code:
ods tagsets.excelxp file = 'G:\CPS\myworkwithoutmissing.xml'
style = printer;
proc tabulate data = final;
Class Year Self_Emp_Inc Self_Emp_Uninc Self_Emp Multi_Job P_Occupation Full_Part_Time_Status;
table Year, P_Occupation*n;
table Year, (P_Occupation*Self_Emp_Inc)*n;
table Year, (Self_Emp_Inc*P_Occupation)*n;
run;
ods tagsets.excelxp close;
When I run this code, I get the following error message:
WARNING: A class, frequency, or weight variable is missing on every observation.
WARNING: A class, frequency, or weight variable is missing on every observation.
WARNING: A class, frequency, or weight variable is missing on every observation.
Now in order to circumvent this issue, I add the "missing" option at the end of the class statement such that:
class year self_emp_inc ....... Full_Part_Time_Status/ missing;
This fixes the problem in that it doesn't give me the error message and creates the table. However, my chart now also counts the number of missing values, something that I do not want. For example my variable self_emp_inc has values of 1 and .(for missing). Now when I run the code with the missing option,I get a count of P_Occupation for all the missing values as well, but I only want the count for when the value of self_emp_Inc is 1. How can I accomplish that task?
This is one of those frustrating things in SAS that for some reason SAS hasn't given us a "good" option to work around. Depending on what you're working with, there are a few solutions.
The real problem here is not that you have missings - in a 1x1 table (1 var by 1 var), excluding missings is what you want. It's because you're calling for multiple tables and each table is affected by missings in the class variables in the other table.
As such, oftentimes the easiest answer is simply to split the tables into multiple proc tabulate statements. This might occasionally be too complicated or too onerous in terms of runtime, but I suspect the majority of the time this is the best solution - it often is for me, anyway.
Since you're only working with n, you could instead construct the tabulation with the missings, output to a dataset, then filter them out and re-print or export that dataset. That's the easiest solution, typically.
How exactly you want to do this of course depends on what exactly you want. For example:
data test_cars;
set sashelp.cars;
if _n_=5 then call missing(make);
if _n_=7 then call missing(model);
if _n_=10 then call missing(type);
if _n_=13 then call missing(origin);
run;
proc tabulate data=test_cars out=test_tabulate(rename=n=count);
class make model type origin/missing;
tables (make model type),origin*n;
run;
data test_tabulate_want;
set test_tabulate;
if cmiss(of make model type origin)>2 then delete;
length colvar $200;
colvar = coalescec(of make model type);
run;
proc tabulate data=test_tabulate_want missing;
class colvar origin/order=data;
var count;
tables colvar,origin*count*sum;
run;
This isn't perfect, though it can be made a lot better with some more work on the formatting - this is just a quick example.
If you're using percents, of course, this doesn't exactly work. You either need to refactor the percents in that data step - which is a bit of work, but doable - or you need separate tabulates for each class variable.

Renaming Coefficients that Result from Proc Logistic/Problems Surrounding Variable Names Common to Multiple Datasets

I am estimating a model for firm bankruptcy that involves 11 factors. I have data from 1900 to 2000 and my goal is to estimate my model using proc logistic for the period 1900-1950 and then test its performance on the 1951 through 2000 data. Proc logistic runs fine but the problem I have is that the estimated coefficients have the same name as my factors that I was using in my model. Suppose the dataset that contains all my observations is called myData and the dataset that contains the estimated coefficients which I obtain using an outtest statement (in proc logistic) is called factorEstimates. Now both of these data sets have the variables factor1, factor2, ..., factorN. Now I want to form the dataset outOfSampleResults that does something like the following:
data outOfSampleResults;
set myData factorEstimates;
newVar=factor1*factor1;
run;
Where the first mention of factor1 refers to that contained in myData and the second refers to that contained in factorEstimates. How can I inform sas which dataset it should read for this variable that is common to both of the datasets in the set statement? Alternatively, how could I quickly rename factor1, factor2, ..., factorN as factor1Estimate, factor2Estimate, ..., factorNEstimate in the factorEstimates dataset so as to circumvent this common variable name issue altogether?
Two quick ways to get estimates for a model already developed:
1. Proc logistic score statement
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_logistic_sect066.htm
Include the data in your original proc logistic but use a new variable and ensure that the dependent variable is missing for the observations you want to predict.
data stacked;
set all;
if year >1950 then predicted=.;
else predicted=y;
run;
proc logistic data=stacked;
model predicted = factor1 - factor12;
output out=out_predicted predicted=p;
run;

One-way random-effects ANOVA in SAS: PROC GLM or MIXED?

I'm attempting to conduct a simple one-way random-effects ANOVA in SAS. I want to know if the population variance is significantly different than zero or not.
On UCLA's idre site, they state to use PROC MIXED as follows:
proc mixed data = in.hsb12 covtest noclprint;
class school;
model mathach = / solution;
random intercept / subject = school;
run;
This makes sense to me given my previous experience with using PROC MIXED.
However, in the text Biostatistical Design and Analysis Using R by Murray Logan, he says for a one-way ANOVA, fixed and random effects are not distinguished and conducts (in R) a "standard" one-way ANOVA even though he's testing the variance, not the means. I've found that in SAS, his R procedure is equivalent to using any of the following:
PROC ANOVA
PROC GLM (same as ANOVA, but with GLM in place of ANOVA)
PROC GLM with RANDOM statement
The p-values from the above three models are the same, but differ from the PROC MIXED model used by UCLA. For my data, it's a difference of p=0.2508 and p=0.3138. Although conclusions don't change in this instance, I'm not really comfortable with this difference.
Can anyone give advice on which one is more appropriate and also why there is this difference?
For your model, the difference between PROC ANOVA and PROC MIXED is only due to numerical noise(REML estimator of PROC MIXED). However, the p-values mentioned in your question correspond to the different tests. In order to get the F value using the output of COVTEST in PROC MIXED, you need to recalculate MS_groups taking into account the unequal sample sizes (either manually as explained on p.231 of http://bio.classes.ucsc.edu/bio286/MIcksBookPDFs/QK08.PDF, or just using PROC MIXED with the same fixed model spec as in PROC ANOVA). This paper (http://isites.harvard.edu/fs/docs/icb.topic1140782.files/S98.pdf) provides some examples of used of PROC MIXED in addition to SAS manual.

Regression with both robust (white) standard errors and CLASS variable for fixed effects

proc glm makes it easy to add fixed effects without creating dummy variables for every possible value of the class variable.
proc reg is able to calculate robust (White) standard errors, but it requires you to create individual dummy variables.
Is there any way to combine these functionalities? I'd like to be able to add a number of class variables and receive White standard errors in my output. For example:
With proc glm, I can do this regression. This will give correct results no matter how many levels are contained in the class variables, but it won't calculate robust standard errors.
proc glm data=ds1;
class class1 class2 class3;
weight n;
model y = c class1 class2 class3 / solution;
run;
with proc reg, I can do :
proc reg data=ds2;
weight n;
model y = x / white;
run;
Which has white standard errors, but doesn't incorporate the fixed effects.
To do that, I might need 50 or more dummy variables and a model statement like model y = x class1_d1 class1_d2 ... class3_dn /white;. Would turn into a crazy number or dummy variables if I started adding interaction terms.
Obviously I could write a macro to create the dummy variables, but this seems like such a basic function that I can't help but think I am missing something obvious (STATA and R both have ways to do this easily). Why can't I either use the class statement in proc reg or get robust standard errors out of proc glm?
I think I found part of the answer although I would be interested in other solutions or tweaks to this one.
proc glmmod can be used to create the dataset for proc reg:
proc glmmod noprint outdesign=ds2 data=ds1;
class class1 class2 class3;
weight n;
model y = c class1 class2 class3;
run;
proc reg data=ds2;
weight n;
model y = col2-col50 / white;
run;
proc glmmod uses the GLM syntax and outputs a regression dataset with all of the dummy variables that proc reg needs.
Not as clean as a single-PROC solution (and you have to keep track of the labels to see what ColXX refers to), but it seems to work perfectly.
I think you can:
(1) remove observations with missing variables
(2) demean the independent variables using proc standard
(3) regress the dependent variables on the demeaned independent variables
http://pages.stern.nyu.edu/~adesouza/sasfinphd/index/node60.html
http://pages.stern.nyu.edu/~adesouza/sasfinphd/index/node61.html
The coefficients from the above procedure are exactly the same as those from proc glm (Frisch-Waugh Theorem). But, you do not have to create dummies (which is your main problem). To get robust standard errors, you can simply use proc reg on step(3) with white standard errors.
Hope that helps.
I think I have an answer for this (or at least, if I don't, I might find out by posting my solution here).
According to this page one can compute robust standard errors with proc surveyreg by clustering the data so that each observation is its own cluster. Like this:
data mydata;
set mydata;
counter=_n_;
run;
proc surveyreg data=mydata;
cluster counter;
model y=x;
run;
But proc surveyreg takes a class statement, so that one can run e.g.
proc surveyreg data=mydata;
class t;
cluster counter;
model y= t x*t / solution;
run;