Recommended values for OpenCV SVM parameters - c++

Any idea on the recommended parameters for OpenCV SVM? I'm playing with the letter_recog.cpp in the OpenCV sample directory, however, the SVM accuracy is very poor! In one run I only got 62% accuracy:
$ ./letter_recog_modified -data /home/cobalt/opencv/samples/data/letter-recognition.data -save svm_letter_recog.xml -svm
The database /home/cobalt/opencv/samples/data/letter-recognition.data is loaded.
Training the classifier ...
data.size() = [16 x 20000]
responses.size() = [1 x 20000]
Recognition rate: train = 64.3%, test = 62.2%
The default parameters are:
model = SVM::create();
model->setType(SVM::C_SVC);
model->setKernel(SVM::LINEAR);
model->setC(1);
model->train(tdata);
Setting it to trainAuto() didn't help; it gave me a weird 0 % test accuracy:
model = SVM::create();
model->setType(SVM::C_SVC);
model->setKernel(SVM::LINEAR);
model->trainAuto(tdata);
Result:
Recognition rate: train = 0.0%, test = 0.0%
Update using yangjie's answer:
$ ./letter_recog_modified -data /home/cobalt/opencv/samples/data/letter-recognition.data -save svm_letter_recog.xml -svm
The database /home/cobalt/opencv/samples/data/letter-recognition.data is loaded.
Training the classifier ...
data.size() = [16 x 20000]
responses.size() = [1 x 20000]
Recognition rate: train = 58.8%, test = 57.5%
The result is no longer 0% but the accuracy is worse than the 62% earlier.
Using the RBF kernel with trainAuto() is worst?
$ ./letter_recog_modified_rbf -data /home/cobalt/opencv/samples/data/letter-recognition.data -save svm_letter_recog.xml -svm
The database /home/cobalt/opencv/samples/data/letter-recognition.data is loaded.
Training the classifier ...
data.size() = [16 x 20000]
responses.size() = [1 x 20000]
Recognition rate: train = 18.5%, test = 11.6%
Parameters:
model = SVM::create();
model->setType(SVM::C_SVC);
model->setKernel(SVM::RBF);
model->trainAuto(tdata);

I suggest trying RBF kernel instead of linear.
In many, many cases it is the best choice...

I debugged the sample code and found the reason.
The responses is a Mat of ASCII code of the letters.
However, the predicted labels returned from SVM trained by SVM::trainAuto are ranging from 0-25, which correspond to the 26 classes. This can also be observed by looking at <class_labels>...</class_labels> in the output file svm_letter_recog.xml.
Therefore in test_and_save_classifier, r = model->predict( sample ) and responses.at<int>(i) are apparently not equal.
I also found that if we use SVM::train, the class labels would be from 65-89 instead, which is why you can get normal result at first.
Solution
I am not sure whether it is a bug. But if you want to use SVM::trainAuto in this sample now, you can change
test_and_save_classifier(model, data, responses, ntrain_samples, 0, filename_to_save);
in build_svm_classifier to
test_and_save_classifier(model, data, responses, ntrain_samples, 'A', filename_to_save);
Update
trainAuto and train should have the same behavior in class_labels. The problem is due to a bug fix before. So I have created a pull request to OpenCV to fix the problem.

Related

ml::KNearest->findNearest() inconsistent result when category label changes

I am using OpenCV 3.4.1.
I am working on a video classification project and am trying to use KNearest to classify between 2 categories. I have 8 areas of interest in each video frame. To make decision on each frame, each KNearest is done on the pixel values on each area. The majority win (and favor to one category if it is a tie). So, I have 8 sets of training data (one for each area of interest).
Problem: The response generated from the knn model changed when I labelled the categories differently.
The training data sets are organized as rows of:
[category label], data0, data1, data2....etc. (different dimensions for each
training set)
where dataX = pixel data of a frame (1 row = 1 frame)
Then, I build the model by:
Ptr<TrainData> tdata = TrainData::loadFromCSV(filename, 0, 0, -1, String("cat"));
Mat raw = tdata->getTrainSamples();
Mat res = tdata->getResponses();
PCA pca(raw, noArray(), PCA::DATA_AS_ROW, 0.99);
Mat knnIn = pca.project(raw);
Ptr<ml::KNearest> knn = ml::KNearest::create();
knn->train(knnIn, ml::ROW_SAMPLE, res);
After that, testing data is passed to the pca and knn to get the response.
To testing it, I put 1 set of testing data to the 8 knn models.
If I use 0 & 1 as the [category label] in the training data sets, the 8 responses from KNN are 1,1,1,1,1,0,0,1.
If I change the label to 2 & 1 instead (replace all '0' by '2' in the first column of the training data), the 8 responses becomes 1,1,1,1,1,1,1,1, while I am expecting 1,1,1,1,1,2,2,1.
Some observations:
no editing error on the training data(while changing the category labels).
The KNN model isClassifier()=true. DefaultK=10. AlgorithmType=1 (BRUTE_FORCE).
The result is consistent with the same training data set and testing data set.
I don't see any pattern on the difference responses for the 2 label sets (after using different
training data sets and testing data sets).
Please shed some light. Thank you very much!

RAY - RLLIB - Failing to train DQN using offline sample batch - episode_len_mean: .nan value

RAY - RLLIB library - estimate a DQN model using offline batch data. Model fails to learn. episode_len_mean: .nan For CartPole example as well as personal-domain-specific dataset
Ubuntu
Ray library - RLIB
DQN
Offline
environment:- tried with Cartpole-v0 as well as with custom environment example.
episode_len_mean: .nan
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
episodes_total: 0
Generate data using PG
rllib train --run=PG --env=CartPole-v0 --config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' --stop='{"timesteps_total": 100000}'
Train model on offline data
rllib train --run=DQN --env=CartPole-v0 --config='{"input": "/tmp/cartpole-out","input_evaluation": ["is", "wis"],"soft_q": true, "softmax_temp": 1.0}'
Expected :-
episode_len_mean: numerical values
episode_reward_max: numerical values
episode_reward_mean: numerical values
episode_reward_min: numerical values
Actual Results (No improvement observed in tensorboard as well) :-
episode_len_mean: .nan
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
I had more or less the same problem and it was linked to the fact that the episode was never finishing because I didn't properly set the "done" value in the step function. Until an episode is "done", Ray doesn't calculate the metrics. In my case I had to specify a counter in the environment init function called self.count_steps and incremented on each step.
def step(self, action):
# Changes the yaw angles
self.cur_yaws = action
self.farm.calculate_wake(yaw_angles=action)
# power output
power = self.farm.get_farm_power()
reward = (power-self.best_power)/self.best_power
#if power > self.best_power:
# self.best_power = power
self.count_steps +=1
if self.count_steps > 50:
done = True
else:
done = False
return self.cur_yaws, reward, done, {}

Tensorflow return similar images

I want to use Google's Tensorflow to return similar images to an input image.
I have installed Tensorflow from http://www.tensorflow.org (using PIP installation - pip and python 2.7) on Ubuntu14.04 on a virtual machine CPU.
I have downloaded the trained model Inception-V3 (inception-2015-12-05.tgz) from http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz that is trained on ImageNet Large Visual Recognition Challenge using the data from 2012, but I think it has both the Neural network and the classifier inside it (as the task there was to predict the category). I have also downloaded the file classify_image.py that classifies an image in 1 of the 1000 classes in the model.
So I have a random image image.jpg that I an running to test the model. when I run the command:
python /home/amit/classify_image.py --image_file=/home/amit/image.jpg
I get the below output: (Classification is done using softmax)
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 3
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 3
trench coat (score = 0.62218)
overskirt (score = 0.18911)
cloak (score = 0.07508)
velvet (score = 0.02383)
hoopskirt, crinoline (score = 0.01286)
Now, the task at hand is to find images that are similar to the input image (image.jpg) out of a database of 60,000 images (jpg format, and kept in a folder at /home/amit/images). I believe this can be done by removing the final classification layer from the inception-v3 model, and using the feature set of the input image to find cosine distance from the feature set all the 60,000 images, and we can return the images having less distance (cos 0 = 1)
Please suggest me the way forward for this problem and how do I do this using Python API.
I think I found an answer to my question:
In the file classify_image.py that classifies the image using the pre trained model (NN + classifier), I made the below mentioned changes (statements with #ADDED written next to them):
def run_inference_on_image(image):
"""Runs inference on an image.
Args:
image: Image file name.
Returns:
Nothing
"""
if not gfile.Exists(image):
tf.logging.fatal('File does not exist %s', image)
image_data = gfile.FastGFile(image, 'rb').read()
# Creates graph from saved GraphDef.
create_graph()
with tf.Session() as sess:
# Some useful tensors:
# 'softmax:0': A tensor containing the normalized prediction across
# 1000 labels.
# 'pool_3:0': A tensor containing the next-to-last layer containing 2048
# float description of the image.
# 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
# encoding of the image.
# Runs the softmax tensor by feeding the image_data as input to the graph.
softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
feature_tensor = sess.graph.get_tensor_by_name('pool_3:0') #ADDED
predictions = sess.run(softmax_tensor,
{'DecodeJpeg/contents:0': image_data})
predictions = np.squeeze(predictions)
feature_set = sess.run(feature_tensor,
{'DecodeJpeg/contents:0': image_data}) #ADDED
feature_set = np.squeeze(feature_set) #ADDED
print(feature_set) #ADDED
# Creates node ID --> English string lookup.
node_lookup = NodeLookup()
top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1]
for node_id in top_k:
human_string = node_lookup.id_to_string(node_id)
score = predictions[node_id]
print('%s (score = %.5f)' % (human_string, score))
I ran the pool_3:0 tensor by feeding in the image_data to it. Please let me know if I am doing a mistake. If this is correct, I believe we can use this tensor for further calculations.
Tensorflow now has a nice tutorial on how to get the activations before the final layer and retrain a new classification layer with different categories:
https://www.tensorflow.org/versions/master/how_tos/image_retraining/
The example code:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py
In your case, yes, you can get the activations from pool_3 the layer below the softmax layer (or the so-called bottlenecks) and send them to other operations as input:
Finally, about finding similar images, I don't think imagenet's bottleneck activations are very pertinent representation for image search. You could consider to use an autoencoder network with direct image inputs.
(source: deeplearning4j.org)
Your problem sounds similar to this visual search project

PCA for dimensionality reduction before Random Forest

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?
I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
Many thanks in advance.
My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA
The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.
#I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.
The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.
The souce-code suggestion for PCA-RF:
#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
f.args=as.list(match.call()[-1])
pca_obj = princomp(x)
rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
out=mget(ls())
class(out) = "PCA_RF"
return(out)
}
#print method
print.PCA_RF = function(object) print(object$rf_obj)
#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
print("predicting PCA_RF")
f.args=as.list(match.call()[-1])
if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
newdata = sXtest), #newdata input, see help(predict.randomForest)
f.args[-1:-2]))) #any other parameters are passed to predict.randomForest
}
#testTrain predict #
make.component.data = function(
inter.component.variance = .9,
n.real.components = 5,
nVar.per.component = 20,
nObs=600,
noise.factor=.2,
hidden.function = function(x) apply(x,1,mean),
plot_PCA =T
){
Sigma=matrix(inter.component.variance,
ncol=nVar.per.component,
nrow=nVar.per.component)
diag(Sigma) = 1
x = do.call(cbind,replicate(n = n.real.components,
expr = {mvrnorm(n=nObs,
mu=rep(0,nVar.per.component),
Sigma=Sigma)},
simplify = FALSE)
)
if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
y = hidden.function(x)
ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
out = list(x=x,y=ynoised)
pars = ls()[!ls() %in% c("x","y","Sigma")]
attr(out,"pars") = mget(pars) #attach all pars as attributes
return(out)
}
A run code example:
#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)
Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[ 1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])
rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)
rf
rf2
pred_rf = predict(rf ,test$x)
pred_rf2 = predict(rf2,test$x)
cat("rf, R^2:",cor(test$y,pred_rf )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)
cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2
pairs(list(trueY = test$y,
native_rf = pred_rf,
PCA_RF = pred_rf2)
)
You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.
Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.

Scikit-learn 0.15.2 - OneVsRestClassifier not works due to predict_proba not available

I am trying to do onevsrest classification like below:
classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
And I get the error 'predict_proba is not available when probability = false'. I saw that there was a bug reported, the one below:
https://github.com/scikit-learn/scikit-learn/issues/1946
And it was closed as fixed, so I killed scikit-learn from my Windows PC and completely re-downloaded scikit-learn to have version 0.15.2. But I still get this error. Any suggestions? Or I understood this wrong, and I still can't use SVC with OneVSRestClassifier unless I specify probability=true?
UPDATE: just to clarify, I am trying to actually achieve multi-label classification, here is data source:
df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))
work = df.ix[rows]
work_test = df.drop(rows)
X_train = []
y_train = []
X_test = []
y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
X_train.append(','.join(i.T.tolist()))
X_train = np.array(X_train)
for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
y_train.append(list(i))
for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
X_test.append(','.join(i.T.tolist()))
X_test = np.array(X_test)
for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
y_test.append(list(i))
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train)
And after that I send it to pipeline mentioned earlier
Ok, I did some investigation in code. OneVsRestClassifier tries to call decision_function first and if it fails - it goes for predict_proba function of base classifier (svm.svc in our case).
As far as I see, my X_test is numpy.array of lists of strings. After it undergoes a sequence of transformations specified in pipeline CountVectorizer -> TfidfTransformer it becomes a sparse matrix (by design of these things). As I see currently decision_function is not available for sparse matrices, and there is even an open suggestion on github: https://github.com/scikit-learn/scikit-learn/issues/73
So, to summarize, looks like you can't make a multilabel classification using svm.svc unless you specify probability=True. If you do this you introduce some overhead to the classifier.fit process but it will work.