I'm using the RFECV module in sklearn to find the optimal number of features to yield the highest Cross validation on 2 folds. I am using a ridge regressor as my estimator.
rfecv = RFECV(estimator=ridge,step=1, cv=KFold(n_splits=2))
rfecv.fit(df, y)
I have 5 features in my dataset that I have standardized using the standardscaler.
I'll run the RFECV on my data, and it'll say that 2 features is optimal. But when I remove one of the features with the lowest regression coefficient and rerun the RFECV, it now says that 3 features is optimal.
When I progress through all features one at a time (as the recursive should do) I find that 3 is in fact the optimal.
I've tested this with other datasets, and have found that the optimal number of features changes as I remove features one at a time and rerun RFECV.
I might be missing something, but isn't that what RFECV is supposed to solve?
Any additional insights on RFECV is appreciated.
This makes sense actually. RFECV is recommending a certain number of features based on the available data. When you remove the feature you change the scoring range.
from the docs:
# Determine the number of subsets of features by fitting across
# the train folds and choosing the "features_to_select" parameter
# that gives the least averaged error across all folds.
...
n_features_to_select = max(
n_features - (np.argmax(scores) * step),
n_features_to_select)
n_features_to_select is used to determine how many features should be used in RFE for any particular iteration (within/under-the-hood of RFECV).
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, verbose=self.verbose)
And so this is directly connected to the number of features you include in your initial rfecv.fit() step.
Also, removing the feature with the lowest regression coefficient is not the best way to trim features. The coefficient is a reflection of its impact on the dependent variable not necessarily the model's accuracy.
Related
I am working on a multilabel text classification problem with 10 labels.
The dataset is small, +- 7000 items and +-7500 labels in total. I am using python sci-kit learn and something strange came up in the results. As a baseline I started out with using the countvectorizer and was actually planning on using the tfidf vectorizer which I thought would work better. But it doesn't.. with the countvectorizer I get a performance of a 0,1 higher f1score. (0,76 vs 0,65)
I cannot wrap my head around why this could be the case?
There are 10 categories and one is called miscellaneous. Especially this one gets a much lower performance with tfidf.
Does anyone know when tfidf could perform worse than count?
The question is, why not ? Both are different solutions.
What is your dataset, how many words, how are they labelled, how do you extract your features ?
countvectorizer simply count the words, if it does a good job, so be it.
There is no reason why idf would give more information for a classification task. It performs well for search and ranking, but classification needs to gather similarity, not singularities.
IDF is meant to spot the singularity between one sample vs the rest of the corpus, what you are looking for is the singularity between one sample vs the other clusters. IDF smoothens the intra-cluster TF similarity.
I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which selects relevant features for each label separately.
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)
I then plan to extract the indices of the included features, per label, using this:
selected_features = []
for i in multi_clf.estimators_:
selected_features += list(i.named_steps["chi2"].get_support(indices=True))
Now, my question is, how do I choose which selected features to include in my final model? I could use every unique feature (which would include features that were only relevant for one label), or I could do something to select features that were relevant for more labels.
My initial idea is to create a histogram of the number of labels a given feature was selected for, and to identify a threshold based on visual inspection. My concern is that this method is subjective. Is there a more principled way of performing feature selection for multilabel datasets using sklearn?
According to the conclusions in this paper:
[...] rank features according to the average or the maximum
Chi-squared score across all labels, led to most of the best
classifiers while using less features.
Then, in order to select a good subset of features you just need to do (something like) this:
from sklearn.feature_selection import chi2, SelectKBest
selected_features = []
for label in labels:
selector = SelectKBest(chi2, k='all')
selector.fit(X, Y[label])
selected_features.append(list(selector.scores_))
// MeanCS
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold
Note: in the code above I'm assuming that X is the output of some text vectorizer (the vectorized version of the texts) and Y is a pandas dataframe with one column per label (so I can select the column Y[label]). Also, there is a threshold variable that should be fixed beforehand.
http://scikit-learn.org/stable/modules/feature_selection.html
There is a multitude of options, but SelectKBest and Recursive feature elimination are two reasonably popular ones.
RFE works by leaving uniformative features out of the model, and retraining, and comparing the results, so that the features left at the end are the ones which enable the best prediction accuracy.
What is best is highly dependant on your data and use case.
Aside from what can loosely be described as cross validation approaches to feature selection, you can look at Bayesian model selection, which is a more theoretical approach and tends to favor more simple models over complex ones.
I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).
I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.
There are two parameters while using RBF kernels with Support Vector Machines: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C;γ) so that the classier can accurately predict unknown data (i.e., testing data).
weka.classifiers.meta.GridSearch is a meta-classifier for tuning a pair of parameters. It seems, however, that it takes ages to finish (when the dataset is rather large). What would you suggest to do in order to bring down the time required to accomplish this task?
According to A User's Guide to Support Vector Machines:
C : soft-margin constant . A smaller value of C allows to ignore points close to the boundary, and increases the margin.
γ> 0 is a parameter that controls the width of Gaussian
Hastie et al.'s SVMPath explores the entire regularization path for C and only requires about the same computational cost of training a single SVM model. From their paper:
Our R function SvmPath computes all 632 steps in the mixture example (n+ = n− =
100, radial kernel, γ = 1) in 1.44(0.02) secs on a pentium 4, 2Ghz linux machine; the svm
function (using the optimized code libsvm, from the R library e1071) takes 9.28(0.06)
seconds to compute the solution at 10 points along the path. Hence it takes our procedure
about 50% more time to compute the entire path, than it costs libsvm to compute a typical
single solution.
They released a GPLed implementation of the algorithm in R that you can download from CRAN here.
Using SVMPath should allow you to find a good C value for any given γ quickly. However, you would still need to do separate training runs for different γ values. But, this should be much faster than doing separate runs for each pair of C:γ values.