Probability Models - sas

What sort of tests are available in SAS that allow you to figure out what the probability of an event occurring is? And I do mean probability, not odds.
Specifically, I would like to know what is the probability a person might say yes vs. no to a particular type of surgery based on their age or insurance status?
I have tried logistic regression, but it appears to only return odds, and again, I am interested in a statistical test that returns probabilities, not odds.
You would think I could just google, "probability models (or tests) SAS" and get an answer, but strangely enough, I haven't - at least not one that is clear. So here I am.
Any help is appreciated. Thank you!

if you are getting odds from logistic regression. This is how the probability should be calculated.
Probability = odds/1+ odds

Related

How can I test overdispersion in STATA when using xtpoisson and xtnbreg?

I have balanced panel data and my dependent variable is count one which distribution has lots of zero(0).
therefore I think it might be suitable for using negative binomial regression rather than poisson one. However, I cannot find how can I test whether xtnbreg or xtpoisson is suitable for my data.
If someone can help how can I test overdispersion to choose poisson model or nbmodel.
Thank you in advance!

Interpreting results using J48 for a divided attribute of interest in x levels (WEKA)

I'm new to data mining and Weka. I built a classifier with J48 in Weka using the GUI, with J48 (training set) for an attribute of interest in five levels. I have to evaluate the precision of the model, but I don't know very well how to do it! Some information may be of interest:
== Detailed Accuracy By Class ===
Precision
0.80
?
0.67
0.56
?
?
First, I would like to know the meaning of the "?" in the precision column. When probing with an attribute of interest in two levels I got no "?". The tree is bigger now than when dividing into two levels. I am questioning if this means that taking an attribute of interest in five levels could generate a less efficient tree in terms of classification and computation time. This seems quite obvious as the number of Correctly Classified Instances when the attribute had 2 levels were up to 72%.
Thank you in advance, all interesting answers will be rewarded!
"I would like to know the meaning of the "?" in the precision column"
Note that for these same classes the TP and FP rates are 0. It appears that J48 has not assigned any of your observations to these classes.
Are these classes relatively small? If so, you might want to consider using the ClassBalancer filter. This will use weights to make all classes look the same size.
Of course, after you get the model you need to "convert back" to the real situation. This is similar for correcting for physically oversampling or undersampling. See my answer here: https://stats.stackexchange.com/questions/211174/how-to-exact-prediction-from-over-sampled-dataundoing-oversampling/257507#257507

Sample Size and Power Calculation for Ordinal response (Ordinal Logistic Regression)

I'm working on a study, looking at the impact of Food insecurity on Anti Retroviral Treatment (ART). ART is divided into three ordered groups: poor, fair, good, and Food Insecurity is ordered into four groups: Severely food insecure, Moderately food insecure, Mildly food insecure, Food secure.
I need to calculate sample size and power for this study, and I've looked around, but haven't found clear answers. Some suggestions indicate estimating from binomial logistic, but that method seems debatable.
If it helps Im familiar with SAS Power procedure, but I also welcome suggestions using R as well.
Thanks in advance!

Need help to find outlier in longitudinal data using sas

I have a classroom of students with test scores taken weekly. I expect the test results to improve over time. I want to identify a poor performer as an outlier based on not improving over time using SAS (have 9.2). Also are there accepted criteria for being an outlier for part of the time interval but not the complete time interval? This is the bulk of my present code (not looking for outliers yet, just longitudinal analysis):
proc mixed data= XYZ_LONG ;
title1 'XYZ Analysis';
class group day subject ;
model TV = group day group*day / ddfm=satterthwaite;
repeated day / type = cs sub = subject ;
Your definition of "poor performer" is not a definition of outlier, I don't think. However:
If you want to find people who did not improve over time, that's pretty easy, but you have to define it more precisely. Did not improve between any two weeks? The first and last weeks? Something else?
And what do you mean by "not improve" exactly? Do you mean it literally (same or worse score at later time?)
In either case, I'd use an array and find difference scores and then identify difference scores that were negative (or whatever you want).
However, if you are going to be doing modelling, then an outlier should probably be defined in terms of that model - that is, in your model, accounting for group. But if you have a lot of outliers and they aren't bad data, you should not throw out those people, but use a better model.

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).