Scikit-learn RandomForestClassifier() feature selection, just select the train set? - python-2.7

I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore I want to reduce my features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the ensemble method, RandomForestClassifier().
Should I just feature select the training data ?
Is it enough if I'm using this code:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = .3 )
clf = RandomForestClassifier( n_estimators = 200,
warm_start = True,
criterion = 'gini',
max_depth = 13
)
clf.fit( X_train, y_train ).transform( X_train )
predicted = clf.predict( X_test )
expected = y_test
confusionMatrix = metrics.confusion_matrix( expected, predicted )
Cause the accuracy didn't get higher. Is everything ok in the code or am I doing something wrong?
I'll be very grateful for your help.

I'm not sure I understood your question correctly so I'll answer to what I thought I understood =)
First, reducing the dimension of your features (from 2048 to 500 e.g.) might not provide you with better results. It all depends on the capacity of your model to catch the geometry of your data. You can get much better results for example with a linear model if you reduce dimension through non-linear methods that would catch a particular geometry and 'linearize' it, instead of directly using this linear model on the raw data. But this is because your data would intrinsicaly be non-linear and the linear model is not good therefore in the original space to catch this geometry (think of a circle in 2D).
In the code you gave, you did not reduce dimension though, you splitted the data into two dataset (feature dimension is the same, 2048, only the number of samples changed). Training on a smaller dataset most of the time results in worst accuracy (data = information, when you leave some out you lose information). But splitting data allows you to test overfitting in particular, which is very impotant. But once the best parameters chosen (see cross-validation) you should learn on all the data you have!
Given your 0.7*800=560 samples, I think a depth of 13 is pretty big and you might overfit. You may want to play with this parameter first if you want to improve your accuracy!

1) Often reducing the features space does not help with accuracy, and using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set of features, another that does the actual supervised task (classification here).
Have you tried just using the standard classifiers? Clearly you tried the RF, but I'd also try a linear method like LinearSVC/LogisticRegression or a kernel SVC.
If you want to do feature selection, what you need to do is something like this:
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
Or you use a pipeline, as here: http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
[cross-posted from the mailing list where this was originally asked]

Dimensionality reduction or feature selection is definitely advisable if you have more features than samples. You could look into Principal Component Analysis and other modules in sklearn.decomposition to reduce the number of features. There is also a useful section on Feature Selection in the scikit-learn documentation.
After fitting sklearn.decomposition.PCA, you could inspect the explained_variance_ratio_ to determine an advisable number of features (n_components) to reduce to (the point of PCA here is to find a reduced number of features that captures most of the variance in your original feature space). Some might like to retain features that have a cumulative explained_variance_ratio_ above 0.9, 0.95 etc, some like to drop features beyond which the explained_variance_ratio_ drops suddenly. Then refit the PCA with the n_components you like, transform your X_train and X_test, and fit your classifier as above.

Related

the prediction figures I got during training a semantic segmentation model

I have tried to experiment with some existing semantics segmentation code against the ultrasound nerve data set.
The implementation is based on u-net architecture.
During the training process, I can capture the verification plot for each epoch. In the following figures, the left one is the raw image, the middle one is the ground truth and the right one is the predicted one (or the probability map).
As shown in the following figures, we can see that the prediction for the epoch0 is just all black, then it seems to me that it started to capture some distribution of original image, then it gets all black again.
I just want to know how to explain the training process based on these plots, why it go back to the result of the first epoch after several training epochs.
Besides, why the predicted result tends to reproduce the distribution of original image during the training process.
Are there any insight can be derived from these training observations?
I generally followed the tutorial to generate the training set. (I use the same function of create_train_data in the tutorial.
The only difference is that I add a background channel, to make the mask image with shape (1,image_row, image_col,2)
img_mask = io.imread(os.path.join(raw_data_path, image_mask_name))
img_mask = img_mask//255
img_mask_background = 1-img_mask
After loading the npy file generated from above, I normalize the raw image of training set
imgs_train = np.load(os.path.join(train_data_path,"imgs_train.npy"))
imgs_mask_train = np.load(os.path.join(train_data_path,"imgs_mask_train.npy"))
imgs_train = imgs_train.astype('float32')
mean = np.mean(imgs_train) # mean for data centering
std = np.std(imgs_train) # std for data normalization
imgs_train -= mean
imgs_train /= std
I follow this implementation to train the model. I did not change anything except this one
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
I change it to
global_step = global_step*self.batch_size
epoch 0
epoch 4
epoch 12
epoch 16

extracting features from all bounding boxes of Fast R-CNN

I just installed Fast RCNN and have run the demo,
and I came to wonder if it's possible to extract features from all bounding boxes in the image (and do this for the entire dataset).
For example, if Fast RCNN detects cat, dog, and a car from an image,
I'd like to extract separate CNN features for each of cat, dog, and car.
And do this for tens of thousands of images.
The feature extraction example on Fast RCNN's Github (https://github.com/rbgirshick/caffe-fast-rcnn/tree/master/examples/feature_extraction) seems to be the replica of feature extraction using caffe for the entire image, not each bounding box.
Could anyone help me on this?
UPDATED:
Apparently, feature extraction for each bounding box is done in the following part of the code from https://github.com/rbgirshick/fast-rcnn/blob/master/lib/fast_rcnn/test.py:
# When mapping from image ROIs to feature map ROIs, there's some aliasing
# (some distinct image ROIs get mapped to the same feature ROI).
# Here, we identify duplicate feature ROIs, so we only compute features
# on the unique subset.
if cfg.DEDUP_BOXES > 0:
v = np.array([1, 1e3, 1e6, 1e9, 1e12])
hashes = np.round(blobs['rois'] * cfg.DEDUP_BOXES).dot(v)
_, index, inv_index = np.unique(hashes, return_index=True,
return_inverse=True)
blobs['rois'] = blobs['rois'][index, :]
boxes = boxes[index, :]
# reshape network inputs
net.blobs['data'].reshape(*(blobs['data'].shape))
net.blobs['rois'].reshape(*(blobs['rois'].shape))
blobs_out = net.forward(data=blobs['data'].astype(np.float32, copy=False),
rois=blobs['rois'].astype(np.float32, copy=False))
if cfg.TEST.SVM:
# use the raw scores before softmax under the assumption they
# were trained as linear SVMs
scores = net.blobs['cls_score'].data
else:
# use softmax estimated probabilities
scores = blobs_out['cls_prob']
if cfg.TEST.BBOX_REG:
# Apply bounding-box regression deltas
box_deltas = blobs_out['bbox_pred']
pred_boxes = _bbox_pred(boxes, box_deltas)
pred_boxes = _clip_boxes(pred_boxes, im.shape)
else:
# Simply repeat the boxes, once for each class
pred_boxes = np.tile(boxes, (1, scores.shape[1]))
if cfg.DEDUP_BOXES > 0:
# Map scores and predictions back to the original set of boxes
scores = scores[inv_index, :]
pred_boxes = pred_boxes[inv_index, :]
return scores, pred_boxes
I'm trying to figure out how to tweak this to save the features, as we do with Caffe for features of the entire images, which are saved to a mdb file.
UPDATE
During the process of determining the right bounding boxes, Fast-RCNN extracts CNN features from a high (~800-2000) number of image regions, called object proposals. These regions are obtained through different algorithms, typically selective search. After this computation, it uses those features to recognize the "right" proposals and find out the "right" bounding box. This is called bounding box regression.
Of course Fast-RCNN optimizes this process, but still has to extract CNN features features from many more regions than the ones related with the object of interest.
Shortly, if you were to save the variable blobs_out in the code snap you pasted, you will save the features relative to all the object proposals, including the "wrong" proposals. But you can save all that and then try to prune and retrieve only the desired ones. To save the features, just use pickle.dump().
Look at the end of the test_net function, here. The nms_dets variable seems to store the final boxes. There may be a way to take the blobs_out you stored and throw the undesired features off, but it doesn't seem so straightforward.
The simplest solution I'm able to think about is as follows.
Let's Fast-RCNN compute the final bounding boxes. Then, extract the relative image patches, with something like the following (I'm assuming Python):
img = cv2.imread('/path/to/image')
for bbox in bboxes_list:
x0, y0, x1, y1 = bbox
cut = img[y0:y1, x0:x1]
extract_cnn_features(cut)
The feature extraction is identical to the entire image case:
net = Caffe.NET('deploy.prototxt', 'caffemodel', caffe.TEST)
# preprocess input
net.blobs['data'].data[...] = net_input
net.forward()
feats = net.blobs['my_layer'].data.copy()
Of course this method is computationally expensive, since you are basically compute twice the CNN features. It depends on your requirements about speed and the size of the CNN models.

Hu moments and SVM does not work

I have come across one problem when trying to train data with SVM.
I get some different regions (set of connected pixels) from face images, and regions from eyes are very similar, so I want to use Hu moments for shape description and SVM for training.
But SVM does not work properly, method svm.predict evaluates afterwards everything as non-eye, moreover the same regions which were labeled and used in traning phase as eye, are evaluated as non-eye.
Feature data consists only of 7 Hu moments. I will post here some samples of source code in a moment, thanks in advance :)
Additional info:
input image:
http://i.stack.imgur.com/GyLO0.png
Setting up basic svm for 1 image:
int image_regions = 10;
Mat training_mat(image_regions ,7,CV_32FC1); // 7 hu moments
Mat labels(image_regions ,1,CV_32FC1); // for labels 1 (eye) and -1 (non eye)
// computing hu moments
Moments moments2=moments(croppedImage,false);
double hu[7];
HuMoments(moments2,hu);
// putting them into svm traning mat
for (int k=0;k<huCounter;k++)
training_mat.at<float>(counter,k) = hu[k]; // counter is current number of region
if (isEye(...))
{
labels.at<float>(counter,0)=1.0;
}
else
{
labels.at<float>(counter,0)=-1.0;
}
//I use the following:
CvSVM svm;
CvSVMParams params;
params.svm_type = CvSVM::C_SVC;
params.kernel_type = CvSVM::LINEAR;
params.term_crit = cvTermCriteria(CV_TERMCRIT_ITER, 1000, 1e-6);
// ... do the above mentioned phase, and then:
svm.train(training_mat, labels, Mat(), Mat(), params);
I hope the following suggestions can help you…..
The simplest task is to use a clustering algorithm and try to cluster the data into two classes. If an algorithm like ‘k-means’ can do the job why make things complex by using SVM and Neural Nets. I suggest you use this technique because your feature vector dimension is of a very small size (7 Hu Moments) as well as your number of samples.
Perform feature Normalization (specified in point 4) to make sure the values fall in a limited range.
Check out “is your data really separable?” As your data is small, take a few samples from positive images and a few samples from negative images and plot the feature vectors. If you can visually see the difference surely any learning algorithm can do the job for you. As I said earlier simple tricks can do better than complex math.
Only if you then decide to use SVM you should know the following:
• As I can see from your code you are using a Linear SVM, may be your data is non-separable by a linear kernel. Try using some polynomial kernel or other kernels. There is one option bool CvSVM::train_auto in openCV just have a look.
• Try to check whether the feature vector values you are getting are proper values or not (make sure that they are not some garbage values).
• Also you can perform feature normalization “ZERO MEAN and UNIT VARIENCE” before you use it for training.
• Most importantly increase the number of images for training, both positively and negatively labeled.
• Last but not least SVM is not magic, at the end of the day it is just drawing a line between two sets of points. So don’t expect it to classify anything you give it as input.
If nothing works “Just improve your feature extraction technique”

Plotting Multiple ROC curves, or an average one from multi class labels (multinomial regression)

I have a data set with multiple discrete labels, say 4,5,6. On this I run the ExtraTreesClassifier (I will also run Multinomial logit afterword on the same data, this is just a short example) as below. :
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import roc_curve, auc
clf = ExtraTreesClassifier(n_estimators=200,random_state=0,criterion='gini',bootstrap=True,oob_score=1,compute_importances=True)
# Also tried entropy for the information gain
clf.fit(x_train, y_train)
#y_test is test data and y_predict is trained using ExtraTreesClassifier
y_predicted=clf.predict(x_test)
fpr, tpr, thresholds = roc_curve(y_test, y_predicted,pos_label=4) # recall my labels are 4,5 and 6
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)
The question is - is there something like a average ROC curve - basically I could add up all the tpr & fpr seperately for EACH label value and then take means (will that make sense by the way?) - and then just call
# Would this be statistically correct, and would mean something worth interpreting?
roc_auc_avearge = auc(fpr_average, tpr_average)
print("Area under the ROC curve : %f" % roc_auc)
I am assuming, I will get something similar to this afterword - but how do I interpret thresholds in this case ?
How to plot a ROC curve for a knn model
Hence, please also mention if I can/should get individual thresholds in this case and why would one approach be better(statistically) over the other?
What I tried so far (besides averaging):
On changing pos_label = 4 , then 5 & 6 and plotting the roc curves, I see very poor performance, even lesser than the y=x (perfectly random and tpr=fpr case) How should I approach this problem ?
ROC curve averaging has been proposed by Hand & Till in 2001. They basically compute the ROC curves for all comparison pairs (4 vs. 5, 4 vs. 6 and 5 vs. 6) and average the result.
When you compute the ROC curve with pos_label=4, you implicitly say that the other labels are the negatives (5 and 6). Note that this is slightly different from what was proposed by Hand & Till.
A few notes:
You should make sure that your classifier was trained in a way that makes sense with your ROC analysis. If you say pos_label=5 in the roc_curve, and your classifier was train to recognize 5 as intermediate between 4 and 6, you will for sure get nothing useful here
If you get AUC < 0.5, it means you are looking at it in the wrong way (and you should reverse your predictions)
In general ROC analysis is useful for a binary classification. Whether it makes sense for multiclass problems is case-dependant, and it might not be the case for you.

How to find/detect optimal parameters of a Grid Search in Libsvm+Weka?

I'm trying to use SVM with Weka framework. So i'm using Libsvm. I'm new to SVM and reading the guide on the site of Libsvm I read that is possible to discover optimal parameters for SVM (cost and gamma) using GridSearch. So i choose Grid Search on Weka and I obtained a bad classification results (TN rate around 1%). So how do I have to interpret these results? If using optimal parameter I got bad results is there no chance for me to get better classification?What I mean is: Grid Search give me the Best results that i can obtain using SVM?
My dataset is formed by 1124 instances (89% negative class, 11% positive class) and there are 31 attributes (2 of them are nominal others are numeric). I'm using a cross validation (10-fold) on the whole dataset to test the model.
I tried to use GridSearch (I normalized each attribute values between 0 and 1, no features selection but I change class value from 0 and 1 to 1 and -1 accroding to SVM theory but T don't know if it useful) with these parameters: cost from 1 to 18 with 1.0 step and gamma from -5 to 10 with 1.0 step. Results are sensitivity 93,6% and specificity 64.8% but these takes around 1 hour to complete computation!!
I'd like to get better results compared with decision tree. Using Features Selection (Info Gain ranking) + SMOTE oversampling + Cost Sensitive Learning I obtained sensitivity 91% and specificity 80%. Is there a way to tune SVM without trying every possible range of values for cost and gamma?