Selecting threshold for a model with binary class labels in python - python-2.7

Usecase : Selecting the "optimal threshold" for a Logistic model built with statsmodel's Logit to predict say, binary classes (or multinomial, but integer classes)
To select your threshold for a (say,logistic) model in Python, is there something that's inbuilt ? For small data sets, I remember, optimizing for the "threshold", by picking up the maximum buckets of true predicted labels (true "0" and true "1") , best seen from the graph here -
http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
I also know intuitively that if I set alpha values, it should give me a "threshold" that I can use below. How should I compute the threshold given a reduced model with variables, all of which are significant at 95% confidence ? Obviously setting the threshold >0.5 ->"1" would be too naive & since I am looking at 95% confidence this threshold should be "smaller" , meaning p >0.2 or something.
This would then mean something like range of "critical values" if the label should be "1" and "0" otherwise.
What I want is something like this -:
test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data
# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989)
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%
# where y_train is REAL values & y_predicted back on the TRAIN dataset . Note that to get y_predicted (as binary, I already did the thresholding as above
Question :-
1. How can I select the threshold in an objective way - ie reduce the percentage of misclassified labels (say I care more for missing "1" (true positives), but not so much if I mispredict a "0" as "1" ( false negatives) & try to reduce this error. This I get from ROC curve . The roc curve in statsmodels(roc_curve) assumes that I have done the labelling for y_predicted class, and I am just revalidating this over test ( point me if my understanding is incorrect). I also think, using the confusion matrix also will not solve picking up the threshold problem
2. Which bring me to - How should I consume the output of these inbuilt functions (oob , confusion_matrix) to suit for selecting the optimal threshold (first on train sample, & then fine tune it over Test & cross validation sample)
I also looked up the official documentation of K-S tests in scipy here-
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
Related -:
Statistics Tests (Kolmogorov and T-test) with Python and Rpy2

Related

Overfitting with random forest though very successful cross validation results

I have moderate experience with data science. I have a data set with 9500 observations and more than 4500 features most of which are highly correlated. Here is briefly what I have tried: I have dropped columns where there are less than 6000 non-NAs and have imputed NAs with their corresponding columns' median values when there are at least 6000 non-NAs. As for correlation, I have kept only features having at most 0.7 correlation with others. By doing so, I have reduced the number of features to about 750. Then I have used those features in my binary classification task in random forest.
My data set is highly unbalanced where ratio of (0:1) is (10:1). So when I apply RF with 10-fold cv, I observe too good results in each cv (AUC of 99%) which is to good to be true and in my test set I got way worse results such as 0.7. Here is my code:
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init(port=23, nthreads=4)
train = fs_rf[fs_rf['Year'] <= '201705']
test = fs_rf[fs_rf['Year'] > '201705']
train = train.drop('Year',axis=1)
test = test.drop('Year',axis=1)
test.head()
train = h2o.H2OFrame(train)
train['BestWorst2'] = train['BestWorst2'].asfactor()
test = h2o.H2OFrame(test)
test['BestWorst2'] = test['BestWorst2'].asfactor()
training_columns = train.drop('BestWorst2',axis=1).col_names
response_column = 'BestWorst2'
model = H2ORandomForestEstimator(ntrees=100, max_depth=20, nfolds=10, balance_classes=True)
model.train(x=training_columns, y=response_column, training_frame=train)
performance = model.model_performance(test_data=test)
print(performance)
How could I avoid this over-fitting? I have tried many different parameters in grid search but none of them improved the results.
This is not what I would call "overfitting". The reason you are seeing really good cross-validation metrics compared to your test metrics is that you have time-series data and so you can't use k-fold cross-validation to give you an accurate estimate of performance.
Performing k-fold cross-validation on a time-series dataset will give you overly-optimistic performance metrics because you are not respecting the time-series component in your data. Regular k-fold cross-validation will randomly sample from your whole dataset to create a train & validation set. Essentially, your validation strategy is "cheating" because you have "future" data included in your CV training sets (if that makes any sense).
I can see by your code that you understand that you need to train with "past" data and predict on "future" data, but if you want to read more about this topic, I'd recommend this article or this article.
One solution is to simply look at test set performance as way to evaluate your model. Another option is to use what's called "rolling" or "time-series" cross-validation, but H2O does not currently support that (though it seems like it might be added soon). Here's a ticket for this if you want to keep track of the progress.

Train program to understand high and low value in machine learning

I am generating alerts by reading dataset for KPI (key performance indicator) . My algorithm is looking into historical data and based on that I am able to capture if there's sudden spike in data. But I am generating false alarms . For example KPI1 is historically at .5 but reaches value 12, which is kind of spike .
Same way KPI2 also reaches from .5 to 12. But I know that KPI reaching from .5 to 12 is not a big deal and I need not to capture that . same way KPI2 reaching from .5 to 12 is big deal and I need to capture that.
I want to train my program to understand what is high value , low value or normal value for each KPI.
Could you experts tell me which is best ML algorithm is for this and any package in python I need to explore?
This is the classification problem. You can use classic logistic regression algorithm to classify any given sample into either high value, low value or normal value.
Quoting from the Wikipedia,
In statistics, multinomial logistic regression is a classification
method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is
a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable,
given a set of independent variables (which may be real-valued,
binary-valued, categorical-valued, etc.)
To perform multi-class classification in python, sklearn library can be useful.
http://scikit-learn.org/stable/modules/multiclass.html

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).

how do generate confidence intervals/standard error of the mean for AUC?

When I use a classifier like LibSVM/SMO under 10-fold cross validation, I get an output for ROC area (AUC) but there is no confidence interval given (as it was performed 10 times). Is there a way to output this?
With "output predictions" in "classifier evaluation option", you get predictions for each instances; with these prediction values you can compute AUC and confidence interval with other program (for exaple R or http://www.biosoft.hacettepe.edu.tr/easyROC/). I suggest to use leave one out cross validation.

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.