Correct way to do Principal Data Analysis in SAS - sas

I am trying to do PCA in SAS; this is the original code I wrote:
ods output Eigenvalues=EVTABLE Eigenvectors=EVCTABLE;
PROC factor DATA=REPLACED_PRINCECOMP EIGENVECTORS;
RUN;
ods output close;
However, when I cross check my results with python sklearn pca, there is a huge difference in the cumulative explained variance ratio (python produces 90% in the first eigenvector, while SAS produces 5% only).
So I changed the code to follow, adding METHOD=principal:
ods output Eigenvalues=EVTABLE Eigenvectors=EVCTABLE;
PROC factor DATA=REPLACED_PRINCECOMP METHOD=principal COV EIGENVECTORS;
RUN;
ods output close;
Now the PCA results are back to the same range as the python results (around 90% variance explained in the first eigenvector).
I was wondering what the cause of the difference is? Is it METHOD=principal? But I was under the impression that the default method is principal according to the documentation, that's why I didn't add it in the first place.
Besides, the latter code also has a strange feature. It produces the number of eigenvalues and eigenvectors way less than python's sklearn package PCA does. Is it because I didn't specify the nfactor variable?

Related

Producing confidence intervals for sensitivity and specificity in SAS

I am using SAS for producing ROC curves. But the "PROC LOGISTIC" does not give me the confidence-interval for sensitivity and specificity.
Does any one know if there is an option in order to produce the lower and upper band for sensitivity and specificity ?
If it is not the case, does anyone know another method ?
Thk an lot,
when I use basic stats, I use proc freq for associations.
proc freq data=tempds noprint;
tables variable1*std_variable2 / chisq measures;
output out=outds pchi n OR FISHER;
run;
The output dataset "outds" now contains RROR(OR), L_RROR(Lower CI), U_RROR(Upper CI). Is this what you are looking for?
If proc logistic doesn't directly support this, you could try bootstrapping - produce many ROC plots for random samples of your data (e.g. using proc surveyselect) and then calculate the p5 and p95 points for each x and y value in the plot using proc summary. This should give a good approximation provided that you use a large enough number of samples.

Suppressing Fisher's exact test for 2x2 table in proc freq

Specs: SAS 9.3 on an old Solaris install. Commenting on mobile; I'm sorry if my formatting gets wonky.
I have some largish datasets (n~=30k patients) and I want to run some 2x2 tables and get chi-square p-values for them. Unfortunately, in its infinite wisdom, SAS has decided to make the Fisher's exact test part of the default output when you ask for chi-square stats for a 2x2 table. Because of the large sample size, SAS throws a warning when it attempts the Fisher's exact test:
"WARNING: Fisher's exact test cannot be computed with sufficient precision for this sample size."
(If anyone from the SAS Institute is reading: there's a reason I didn't request that test, friends!)
I need this warning not to happen because I'm embedding this SAS call in a GNU make script, and make will stop on warnings. I am pretty sure NOWARN only suppresses the "chi square may not be accurate with cell sizes this small" warning and not this one. Is there a way to suppress Fisher's exact test itself in this instance? I also tried calculating chi square by hand, but I need an output dataset that includes the overall N, and I can't use an OUTPUT statement that doesn't call for any statistics besides N.
Edit: Here is one table that causes problems, with {Nij} rounded up.
var1,var2: N
-------------------
P,X: 10000
P,Y: 3600
Q,X: 13000
Q,Y: 1000
Assuming you're talking about the warning in the table itself (and not one in the log), you can exclude that portion with ODS (dest.) EXCLUDE.
Assuming HTML is your destination (otherwise modify that part to LISTING or PDF or whatnot):
ods html exclude fishersexact;
proc freq data=sashelp.snacks;
tables advertised*holiday/chisq;
run;
ods html exclude none;

How does SAS calculate standard errors of coefficients in logistic regression?

I am doing a logistic regression of a binary dependent variable on a four-value multinomial (categorical) independent variable. Somebody suggested to me that it was better to put the independent variable in as multinomial rather than as three binary variables, even though SAS seems to treat the multinomial as if it is three binaries. THeir reason was that, if given a multinomial, SAS would report std errors and confidence intervals for the three binary variables 'relative to the omitted variable', whereas if given three binaries it would report them 'relative to all cases where the variable was zero'.
When I do the regression both ways and compare, I see that nearly all results are the same, including fit statistics, Odds Ratio estimates and confidence intervals for odds ratios. But the coefficient estimates and conf intervals for those differ between the two.
From my reading of the underlying theory,as presented in Hosmer and Lemeshow's 'Applied Logistic Regression', the estimates and conf intervals reported by SAS for the coefficients are consistent with the theory for the regression using three binary independent variables, but not for the one using a 4-value multinomial.
I think the difference may have something to do with SAS's choice of 'design variables', as for the binary regression the values are 0 and 1, whereas for the multinomial they are -1 and 1. But I don't really understand what SAS is doing there.
Does anybody know how SAS's approach differs between the two regressions, and/or can explain the differences in the outputs?
Here is a link to the SAS output:
SAS output
And here is the SAS code:
proc logistic data=tab descending;
class binB binC binD / descending;
model y = binD binC binB ;
run;
proc logistic data=tab descending;
class multi / descending;
model y = multi;
run;

use estimates from proc glm to make prediciton on another dataset

I'm not so familiar with SAS proc glm. All I have done using proc glm so far is to output parameter estimates and predicted values on training datasets. But I also need to use the fitted model to make prediction on testing dataset. (both point estimates and interval estimates)
Here is my code.
ods output ParameterEstimates=Pi_Parameters FitStatistics=Pi_Summary PredictedValues=Pi_Fitted;
proc glm data=Train_Pi;
class Area Fo5 Tye M0 M1 M2 M3;
model Pi = Dow Area Fo5 Tye M0|HC M1|HC M2|HC M3|HC/solution p ss3 /*tolerance*/;
run;
But how to proceed to next step? something like predict(Model_from_Train_Pi,Test_Pi)
If you're on SAS 9.4 see Jake's answer from this question:
How to predict probability in logistic regression in SAS?
If not on 9.4, my answer applies for adding the data in to the original data set.
A third option is PROC SCORE - documentation has an example for proc reg that's almost identical to your question:
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_score_sect018.htm

how to find outliers in sas with proc means?

is there a way to detect an outlier from proc means while calculating min max Q1 and Q3?
the box plot procedure is not working on my SAS and I am trying to perform a boxplt in excel with the values from SAS.
Assuming you have a specific definition for what an outlier is, PROC UNIVARIATE can calculate the value that appears at that percentile using the PCTLPTS keyword on the OUTPUT statement. It also will identify extreme observations individually, so you can see the top few observations (if you have few enough observations that the number of extremes is likely to be <= 5).
The paper A SAS Application to Identify and Evaluate Outliers goes over a few of the ways you can look at outliers, including box plots and PROC UNIVARIATE, and includes some regression-based approaches as well.
If you want a 'standard boxplot' use the outbox= option in SAS to create the standard data set used for a box plot.
proc boxplot data=sashelp.class;
plot age*sex / outbox = xyz;
run;