Listing values of Principal Components - pca

I have written the following in SAS:
data test;
infile 'C:\Users\Public\Documents\test.dat';
input a b c d e id;
run;
proc princomp cov out=a;
var a b c d e;
run;
proc corr;
var prin1 prin2 prin3 a b c d e;
run;
Is there a way to list the values of the principal components for each id? The output I receive are just summary statistics (i.e. max and min) and the correlations.

If you want separate analyses by ID then you can use a BY statement. This gives you separate principle components for each value of ID. The dataset has to be sorted by ID to use it in a BY statement.
proc sort data = test;
by id;
run;
proc princomp data = test cov out = scores statout = stats;
var a b c d e;
by id;
run;
The output dataset which I called SCORES should contain all variables from TEST along with new variables which contain the principle component scores. The output dataset STATS contains various statistics including the eigenvectors.
A good place to look for SAS solutions is in the extensive SAS online documentation. The documentation for PROC PRINCOMP is here.
I hope that helps!

Related

how can create a prediction interval based on a linear model in SAS

I am trying to create a prediction interval in SAS. My SAS code is
Data M;
input y x;
datalines;
100 20
120 40
125 32
..
;
proc reg;
model y = x / clb clm alpha =0.05;
Output out=want p=Ypredicted;
run;
data want;
set want;
y1= Ypredicted;
proc reg data= want;
model y1 = x / clm cli;
run;
but when I run the code I could find the new Y1 how can I predict the new Y?
What you're trying to do is score your model, which takes the results from the regression and uses them to estimate new values.
The most common way to do this in SAS is simply to use PROC SCORE. This allows you to take the output of PROC REG and apply it to your data.
To use PROC SCORE, you need the OUTEST= option (think 'output estimates') on your PROC REG statement. The dataset that you assign there will be the input to PROC SCORE, along with the new data you want to score.
As Reeza notes in comments, this is covered, along with a bunch of other ways to do this that might work better for you, in Rick Wicklin's blog post, Scoring a regression model in SAS.

SAS - Create Dummy Variables for All Variables

I have a dataset with X number of categorical variables for a given record. I would like to somehow turn this dataset into a new dataset with dummy variables, but I want to have one command / macro that will take the dataset and make the dummy variables for all variables in the dataset.
I also dont want to specify the name of each variable, because I could have a dataset with 50 variables so it would be too cumbersome to have to specify each variable name.
Lets say I have a table like this, and I want the resulting table, with the above conditions that I want a single command or single macro without specifying each individual variable:
You can use PROC GLMSELECT to generate the design matrix, which is what you are asking for.
data test;
input id v1 $ v2 $ v3 $ ;
datalines;
1 A A A
2 B B B
3 C C C
4 A B C
5 B A A
6 C B A
;
proc glmselect data=test outdesign(fullmodel)=test_design noprint ;
class v1 -- v3;
model id = v1 -- v3 /selection=none noint;
run;
You can use the -- to specify all variables between the first and last. Notice I don't have to type v2. So if you know first and the last, you can get want you want easily.
I prefer GLMMOD myself. One note, if you can, CLASS variables are usually a better way to go, but not supported by all PROCS.
/*Run model within PROC GLMMOD for it to create design matrix
Include all variables that might be in the model*/
proc glmmod data=sashelp.class outdesign=want outparm=p;
class sex age;
model weight=sex age height;
run;
/*Create rename statement automatically
THIS WILL NOT WORK IF YOUR VARIABLE NAMES WILL END UP OVER 32 CHARS*/
data p;
set p;
if _n_=1 and effname='Intercept' then
var='Col1=Intercept';
else
var=catt("Col", _colnum_, "=", catx("_", effname, vvaluex(effname)));
run;
proc sql ;
select var into :rename_list separated by " " from p;
quit;
/*Rename variables*/
proc datasets library=work nodetails nolist;
modify want;
rename &rename_list;
run;
quit;
proc print data=want;
run;
Originally from here and the post has links to several other methods.
https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-dummy-variables-Categorical-Variables/ta-p/308484
Here is a worked example using your simple three observation dataset and a modified version of the PROC GLMMOD method posted by #Reeza
First let's make a sample dataset with a long character ID variable. We will introduce a numeric ROW variable that we can later use to merge the design matrix back with the input data.
data have;
input id :$21. education_lvl $ income_lvl $ ;
row+1;
datalines;
1 A A
2 B B
3 C C
;
You could set the list of variables into a macro variable since we will need to use it in multiple places.
%let varlist=education_lvl income_lvl;
Use PROC GLMMOD to generate the design matrix and the parameter list that we will later use to generate user friendly variable names.
proc glmmod data=have outdesign=design outparm=parm noprint;
class &varlist;
model row=&varlist / noint ;
run;
Now let's use the parameter list to generate rename statement to a temporary text file.
filename code temp;
data _null_;
set parm end=eof;
length rename $65 ;
rename = catx('=',cats('col',_colnum_),catx('_',effname,of &varlist));
file code ;
if _n_=1 then put 'rename ' ;
put #3 rename ;
if eof then put ';' ;
run;
Now let's merge back with the input data and rename the variables in the design matrix.
data want;
merge have design;
by row ;
%inc code / source2;
run;

PROC CORR Pearson's for categorical variable

I'm trying to find the Pearson correlation coefficient between weight and height for species Pike in sashelp.fish, but I'm having issues returning the results specifically for Pike. Here's my code:
proc corr data=sashelp.fish pearson;
var height width;
by species;
run;
And here's the error message:
Data set SASHELP.FISH is not sorted in ascending sequence. The current BY group has Species = Whitefish and the next BY group has Species = Parkki.
I tried using PROC SORT to sort the data by Species, but received the error message "User does not have appropriate authorization level for library SASHELP."
Thank you!
If you don't specify an output dataset then SAS by default will overwrite the input data with the new sorted data. However you do not had write access to the sashelp library and can't replace the sashelp.fish dataset. You therefore need to create a new sorted output dataset that you can then run proc corr on:
Example using your temporary work library:
proc sort data = sashelp.fish out = work.fish;
by species;
run;
proc corr data=fish pearson;
var height width;
by species;
run;

How to run PROC LOGISTIC on all variables in your dataset in SAS?

I have a dataset with 300+ variables and I want to perform stepwise selection in PROC LOGISTIC (I understand stepwise selection is a bad idea here but it's not up to me) on all these variables - some of which are numeric and some of which are categorical.
Without typing the name of each of the 300+ variables, how do I write the model statement so that the model is all variables in my data set except for my response variable? How do I write the class statement so that it knows to treat all the categorical variables as categorical?
You can quickly grab all the headings of your dataset to copy and paste with this:
proc contents data = X short;
run;
This will generate a list that you can copy and paste into your proc logistic statement.
Assuming your class variables are character based you can do the following:
proc contents data = X out=test;
run;
data test; set test;
if TYPE=2;
run
proc transpose data=test out=test2;
var name;
id name;
run;
proc contents data = test2 short;
run;

Is there a way to name proc rank groups based on values within the group?

So I have multiple continuous variables that I have used proc rank to divide into 10 groups, ie for each observation there is now a "GPA" and a "GRP_GPA" value, ditto for Hmwrk_Hrs and GRP_Hmwrk_Hrs. But for each of the new group columns the values are between 1 - 10. Is there a way to change that value so that rather than 1 for instance it would be 1.2-2.8 if those were the min and max values within the group? I know I can do it by hand using proc format or if then or case in sql but since I have something like 40 different columns that would be very time intensive.
It's not clear from your question if you want to store the min-max values or just format the rank columns with them. My solution below formats the rank column and utilises the ability of SAS to create formats from a dataset. I've obviously only used 1 variable to rank, for your data it will be a simple matter to wrap a macro around the code and run for each of your 40 or so variables. Hope this helps.
/* create ranked dataset */
proc rank data=sashelp.steel groups=10 out=want;
var steel;
ranks steel_rank;
run;
/* calculate minimum and maximum values per rank */
proc summary data=want nway;
class steel_rank;
var steel;
output out=want_min_max (drop=_:) min= max= / autoname;
run;
/* create dataset with formatted values */
data steel_rank_fmt;
set want_min_max (rename=(steel_rank=start));
retain fmtname 'stl_fmt' type 'N';
label=catx('-',steel_min,steel_max);
run;
/* create format from previous dataset */
proc format cntlin=steel_rank_fmt;
run;
/* apply formatted value to rank column */
proc datasets lib=work nodetails nolist;
modify want;
format steel_rank stl_fmt10.;
quit;
In addition to Keith's good answer, you can also do the following:
proc rank data = sashelp.cars groups = 10 out = test;
var enginesize;
ranks es;
run;
proc sql ;
select *, catx('-',min(enginesize), max(enginesize)) as esrange, es from test
group by es
order by make, model
;
quit;