HPSplit procedure in SAS - sas

Wondering if someone could help me figure out where I'm going wrong with this HPSplit procedure to form a decision tree? Here are the instructions for classification using decision tree. We are using SAS Studio (University Edition):
"Using the variable default as the response variable, fit a decision tree model with the predictor variables student, balance, and income."
The biggest problem seems to be the default and student variables are yes/no values and the others are numbers. I can't seem to get the predictor variables student and income to be a part of the tree. Here's the code I have. I've tried all sorts of combinations.
ODS GRAPHICS ON;
PROC HPSPLIT DATA=MYFOLDER.DEFAULT;
Class default student;
model default=balance student income;
output out=hpsliout;
prune costcomplexity;
run;
Here's what the tree looks like:
subtree starting at node=0
Thanks for the help!

Related

Interested in learning about something new in SAS

I am trying to create a decision tree in SAS. Whenever I run my SAS code I get an error message.
Could I have someone assistance?
Sample code:
proc hpsplit data=credit;
class default student;
model default = student balance income
output out=hpsliout;
prune costcomplexity;
run;

Linear Regression: Finding Significant Class Variables Using SAS

I'm attempting to use SAS to do a pretty basic regression problem but I'm having trouble getting the full set of results.
I'm using a data set that includes professors' overall quality (the dependent variable) and has the following independent variables: gender, numYears, pepper, discipline, easiness, and rateInterest.
I'm using the code below to generate the analysis of the data set:
proc glm data=WORK.IMPORT;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest;
run;
I get the following results, which is mostly what I need, EXCEPT that I would like to see exactly which responses from the class variables (gender, pepper, discipline) are significant.
From these results, I can see that easiness, rateInterest, pepper, and discipline are significant; however, I'd like to see which specific values of pepper and discipline are significant. For example, pepper was answered as a 'yes' or 'no' by the student. I'd like to see if quality correlates specifically to pepperyes or pepperno. Can anyone give me some advice about how to alter my code to return a breakdown of the class variables?
Here is also a link to the dataset, in case it's needed for reference:
https://drive.google.com/file/d/1Kc9cb_n-l7qwWRNfzXtZi5OsiY-gsYZC/view?usp=sharingRateprof
I really, truly appreciate any assistance!
Add the solution option to your model statement to break out statistics of each class variable; however, reference parameterization is not available in proc glm, and will cause biased estimates. There are ways around this to continue using proc glm, but the simplest solution is to use proc glmselect instead. proc glmselect allows you to specify reference parameterization. Use the selection=none option to disable variable selection.
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline / param=reference;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
The interpretation of this would be:
All other variables held constant, females affect the quality rating by
-0.046782 units compared to males. This variable is not statistically significant.
The breakdown of each class level is a comparison to a reference value. By default, the reference value selected is the last level after all class values are internally sorted. You can specify a reference using the ref= option after each class variable. For example, if you wanted to use females as a reference value instead of males:
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
Note that you can also do this with prox mixed. For this specific purpose, the preference is up to you based on the output style that you like. proc mixed is a more flexible way to run regressions, but would be a bit overkill here.
proc mixed data=import;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / solution;
run;

How to obtain individual estimates (slopes/intercepts) in proc mixed (SAS)

I'm interested in seeing how sedentary behaviors change throughout time (Time 1, 2, 3) and see in a second step how it relates to mental health.
Thus, I would like to obtain an estimate (slope/intercept) for each subject to allow me to do the 2nd step. I can't find online how to do it (not sure what to search for).
Here's my code so far, which gives me 2 estimates (boys and girls); I would rather have an estimate for every participant.
ods output LSMeans=Means1;
proc mixed data=sb.LFcomplete method=ml covtest;
class SexeF time;
model CompDay = Time SexeF Time*SexeF;
repeated time;
lsmeans time*sexeF;
run;
Thank you in advance!
Please check this website for a similar example:
https://www.stat.ncsu.edu/people/davidian/courses/st732/examples/ex10_1.sas
The professor was using HLM for longitudinal data analysis. He used gender like your SexeF, age like your Time, and child as ID. The tricky part is when he was organizing the random effect file, he sorted ID and created Gender (Group or your SexeF) for subsequent merging with the fixed effects file. If your current ID variable is not aligned with your SexeF, you may sort your SexeF and create a new ID variable in SPSS before you import your data in SPSS.

put already known coefficient to the new dataset to predict new y

all.
I have already got output from logistic regression, those coefficients. I would like to plug those coefficients into new data set and use the variables in the new data set but the old coefficients to predict new "y". What should I do?I have already tried proc score, but not sure if it is the proper way.
Use can use PROC Logistic Inmodel statement, See the example from SAS documentation here:-
proc logistic inmodel = your_coefficient_file_from_logistic_run;
score data= new_dataset_to_score out=new_scored_dataset;
run;
Let me know if you have any questions

identify nature of missingness for categorical variables

could you please give me some hints for identifying the nature of missingness for categorical variables' missing value? I mean, I gave a fast search on google scholar but I didn't find anything related with this. How could I understand if missing-values are missing completely at random, are they missing at random or finally, they are missing not at random? Except studying the domain I can't think anything. Links to some papers are appreciated, Thanks in advance.
(I'll add it in sas environment but the question is not specifically related with this language).
Since you've tagged this as SAS, one approach you could take would be to create a boolean variable for each of your categorical variables indicating whether or not it has a missing value in each row. Then you could do whatever analysis you like on the frequency of missing values, using the flags. E.g. you could use proc corr to see if missing values of one variable correlate with values of other variables.
E.g. suppose you have a situation like this:
data example;
set sashelp.class;
if AGE > 14 then call missing(SEX);
SEX_MISSING_FLAG = missing(SEX);
run;
Then you could spot it by running the following:
proc corr data = example outp= corr;
var age weight height sex_missing_flag;
run;
Output:
_TYPE_,_NAME_,Age,Weight,Height,SEX_MISSING_FLAG
MEAN,,13.32,100.03,62.34,0.26
STD,,1.49,22.77,5.13,0.45
N,,19.00,19.00,19.00,19.00
CORR,Age,1.00,0.74,0.81,0.78
CORR,Weight,0.74,1.00,0.88,0.64
CORR,Height,0.81,0.88,1.00,0.55
CORR,SEX_MISSING_FLAG,0.78,0.64,0.55,1.00