Random Forest : finding relevant features - python-2.7

I am trying to train a RF model in sklearn for classification. The accuracy I get for the test is quite low with a specified set of feature vector. I assume that the feature vector I chose is misleading the model. So I tried RFE, RFECV etc to find a relevant set of feature vector - didn't help to improve the accuracy. I came up with a simple feature selection process as below>
ml_feats = #initial set of feature vector
while True
feats_to_del=[]
prev_score=0
for feat_len in range(2,len(ml_feats)):
classifier = RandomForestClassifier(**init_params)
classifier.fit(X[ml_feats[:feat_len]],Y)
score = classifier.score(Xt[ml_feats[:feat_len]],Yt)
if score<prev_score:
#feature that caused the score to decrease
print ml_feats[feat_len]
feat_to_del.append(ml_feats[feat_len])
prev_score=score
if len(feats_to_del)==0:
break
#delete irrelevant features
ml_feats=list(set(ml_feats)-set(feats_to_del))
print ml_feats #print all relevant features
Does the above code help figure out right set of features?
Thanks

What you are doing is a greedy feature selection. If you want to use RandomForestClassifier to select features, you can do something like:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# xtrain : training data
# ytrain : training labels
clf = RandomForestClassifier()
sfm = SelectFromModel(estimator=clf, threshold='mean') # threshold of selection is mean of feature importances by random forest classifier
sfm.fit(xtrain, ytrain)
selected_xtrain = sfm.transform(xtrain)

Related

How to generate Scikit-Learn Gaussian Process regression with 2D input, 1D output

I have been looking for the answer to my question for quite a while. No Luck so far :(. I will my question as simple as possible. For the simplicity I only have a 2D input(it will eventually grow). Lets say I am using two variables (feature : Vehicle Odometer measurement, New Car Price) to predict a value of the car (target : Old car price) How can I train sklearn.gaussian_process.GaussianProcessRegressor to predict what I am looking for.
from sklearn import gaussian_process
X_train = np.array(X).reshape((-1, 2)).astype(int)
y_train = np.array(y).reshape(-1,1).astype(int)
GPR = gaussian_process.GaussianProcessRegressor(normalize_y = False,n_restarts_optimizer = 3)
GPR.fit(X_train,y_train)
#creating random points for testing the data
X_test_Odometer = np.linspace(0, 268000, 1000)[:, None]
X_test_Price = random.sample(range(5000, 13000), 1000)
X_test = np.column_stack((X_test_Odometer,X_test_Price)).astype(int)
GPR.predict(X_test)
This prediction doesnot work at all. I do not know whether I need to customize a kernel. If yes, I do not know how to. I am new to scikit and any help would be appreciated :)

How to fine-tune ResNet50 in Keras?

Im trying to finetune the existing models in Keras to classify my own dataset. Till now I have tried the following code (taken from Keras docs: https://keras.io/applications/) in which Inception V3 is fine-tuned on a new set of classes.
from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)
# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- let's say we have 200 classes
predictions = Dense(200, activation='softmax')(x)
# this is the model we will train
model = Model(inputs=base_model.input, outputs=predictions)
# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
layer.trainable = False
# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# train the model on the new data for a few epochs
model.fit_generator(...)
# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.
# let's visualize layer names and layer indices to see how many layers
# we should freeze:
for i, layer in enumerate(base_model.layers):
print(i, layer.name)
# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 172 layers and unfreeze the rest:
for layer in model.layers[:172]:
layer.trainable = False
for layer in model.layers[172:]:
layer.trainable = True
# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy')
# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layers
model.fit_generator(...)
Can anyone plz guide me what changes should I do in the above code so as to fine-tune ResNet50 model present in Keras.
Thanks in advance.
It is difficult to make out a specific question, have you tried anything more than just copying the code without any changes?
That said, there is an abundance of problems in the code: It is a simple copy/paste from keras.io, not functional as it is, and needs some adaption before working at all (regardless of using ResNet50 or InceptionV3):
1): You need to define the input_shape when loading InceptionV3, specifically replace base_model = InceptionV3(weights='imagenet', include_top=False) with base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(299,299,3))
2): Further, you need to adapt the number of the classes in the last added layer, e.g. if you have only 2 classes to: predictions = Dense(2, activation='softmax')(x)
3): Change the loss-function when compiling your model from categorical_crossentropy to sparse_categorical_crossentropy
4): Most importantly, you need to define the fit_generator before calling model.fit_generator() and add steps_per_epoch. If you have your training images in ./data/train with every category in a different subfolder, this can be done e.g. like this:
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator()
train_generator = train_datagen.flow_from_directory(
"./data/train",
target_size=(299, 299),
batch_size=50,
class_mode='binary')
model.fit_generator(train_generator, steps_per_epoch=100)
This of course only does basic training, you will for example need to define save calls to hold on to the trained weights. Only if you get the code working for InceptionV3 with the changes above I suggest to proceed to work on implementing this for ResNet50: As a start you can replace InceptionV3() with ResNet50() (of course only after from keras.applications.resnet50 import ResNet50), and change the input_shape to (224,224,3) and target_size to (224,244).
The above mentioned code-changes should work on Python 3.5.3 / Keras 2.0 / Tensorflow backend.
Beyond the important points mentioned in the above answer for ResNet50 (! if your images are shaped into similar format as in the original Keras code (224,224) - not of rectangular shape) you may substitute:
# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
by
x = base_model.output
x = Flatten(x)
EDIT: Please read #Yu-Yang comment bellow
I think I experienced the same issue. It appeared to be a complex problem, which has a decent thread on github(https://github.com/keras-team/keras/issues/9214). The problem is in Batch Normalization of unfreezed blocks of the net. You have two solutions:
Only change top layer(leaving the blocks as they are)
Add a patch from the github thread above.

How to train a classifier with an array of arrays?

I want to use a decision tree classifier in order to predict something.
As you can see here:
from sklearn import tree
sample1 = [120,1]
sample2 = [123,3]
features = [sample1,sample2]
labels = [0,1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
I have two samples:
Sample one: [120,1] which I labelled as 0
Sample two: [123,3] which I labelled as 1
So far so good.
But now, instead of this samples, I want to train using an array, something like:
features = [[120,120.2][1, 1.2]]
and the respective label for this sample is:
label = [1]
So my code should be:
from sklearn import tree
features = [[120,120.2][1, 1.2]]
labels = [1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
I'm getting the following error:
TypeError: list indices must be integers, not tuple
I understand that the classifier wants a list of integers, and not tuples.
And a solution may be:
features = [[120, 120.2, 1, 1.2]]
labels = [1]
But I don't want to mix up the data, since I have it separately into arrays.
Is there any way I can train my classifier with arrays of arrays of data?
Thanks
No you can't use this format with your data, you need to aggregate them in one array.
The expected shape is (n_samples, n_features).
It's even more logic, because an example is described by some features and by using the expected format it describes better your data.

using weka Filter in java code

I have a problem with using weka api in java. There are 41 features(or attributes) in my training and testing dataset. I want to take only 25 attributes (eg say 1,3,5,7,8,10.....) and remove other attributes during training and testing the classifier. I have read Weka's Filter manual available at http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Filter and http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-stable/3.6.6/weka/filters/unsupervised/attribute/Remove.java but I could not understand how to use filter in my problem. Could you please help me how to write code for this situation. Your suggestions/help will be highly appreciated.
My code is like this....
import weka.classifiers.meta.FilteredClassifier;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
Instances train = ...
Instances test = ...
Here I want to take only 25 attributes(i.e column values) out of 41.
Classifier cls = new J48();
cls.buildClassifier(train);
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(train);
eval.evaluateModel(cls, test);
.....
.....
Assuming you have this, as you said:
import weka.classifiers.meta.FilteredClassifier;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
Instances train = ...
Instances test = ...
Then set up the array of column indices you want. I'm assuming you're doing this in a for loop or something, but I've done just put 6 indices in manually so you get the idea.
int[] indicesOfColumnsToUse = [1,3,5,7,8,10];
Then initialize and set up your removal filter (initialize it, then set the column indices, then invert your selection so that you remove the ones you don't want, then set the "input format" based on your training data)
Remove remove = new Remove();
remove.setAttributeIndices(indicesOfColumnsToUse);
remove.setInvertSelection(true);
remove.setInputFormat(train);
Then apply the removal to your training set
Instances trainingSubset = Filter.useFilter(train, remove);
And then go on as you said, except train the classifier on the subset that you just created:
Classifier cls = new J48();
cls.buildClassifier(trainingSubset);
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(train);
eval.evaluateModel(cls, test);

scikit-learn PCA doesn't have 'score' method

I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
sklearn.decomposition.ProbabilisticPCA
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
Christos
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.
thx