cvSVM training produces poor results for HOGDescriptor - c++

My objective is to train an SVM and get support vectors which i can plug into opencv's HOGdescriptor for object detection.
I have gathered 4000~ positives and 15000~ negatives and I train using the SVM provided by opencv. the results give me too many false positives.(up to 20 per image) I would clip out the false positives and add them into the pool of negatives to retrain. and I would end up with even more false positives at times! I have tried adjusting L2HysThreshold of my hogdescriptor upwards to 300 without significant improvement. is my pool of positives and negatives large enough?
the SVM training is also much faster than expected. I have tried with a feature vector size of 2916 and 12996, using grayscale images and color images on separate tries. SVM training has never taken longer than 20 minutes. I use auto_train. I am new to machine learning but from what i hear training with a dataset as large as mine should take at least a day no?
I believe cvSVM is not doing much learning and according to http://opencv-users.1802565.n2.nabble.com/training-a-HOG-descriptor-td6363437.html, it is not suited for this purpose. does anyone with experience with cvSVM have more input on this?
I am considering using SVMLight http://svmlight.joachims.org/ but it looks like there isn't a way to visualize the SVM hyperplane. What are my options?
I use opencv2.4.3 and have tried the following setsups for hogdescriptor
hog.winSize = cv::Size(100,100);
hog.cellSize = cv::Size(5,5);
hog.blockSize = cv::Size(10,10);
hog.blockStride = cv::Size(5,5); //12996 feature vector
hog.winSize = cv::Size(100,100);
hog.cellSize = cv::Size(10,10);
hog.blockSize = cv::Size(20,20);
hog.blockStride = cv::Size(10,10); //2916 feature vector

Your first descriptor dimension is way too large to be any useful. To form any reliable SVM hyperplane, you need at least the same number of positive and negative samples as your descriptor dimensions. This is because ideally you need separating information in every dimension of the hyperplane.
The number of positive and negative samples should be more or less the same unless you provide your SVM trainer with a bias parameter (may not be available in cvSVM).
There is no guarantee that HOG is a good descriptor for the type of problem you are trying to solve. Can you visually confirm that the object you are trying to detect has a distinct shape with similar orientation in all samples? A single type of flower for example may have a unique shape, however many types of flowers together don't have the same unique shape. A bamboo has a unique shape but may not be distinguishable from other objects easily, or may not have the same orientation in all sample images.
cvSVM is normally not the tool used to train SVMs for OpenCV HOG. Use the binary form of SVMLight (not free for commercial purposes) or libSVM (ok for commercial purposes). Calculate HOGs for all samples using your C++/OpenCV code and write it to a text file in the correct input format for SVMLight/libSVM. Use either of the programs to train a model using linear kernel with the optimal C. Find the optimal C by searching for the best accuracy while changing C in a loop. Calculate the detector vector (a N+1 dimensional vector where N is the dimension of your descriptor) by finding all the support vectors, multiplying alpha values by each corresponding support vector, and then for each dimension adding all the resulting alpha * values to find an ND vector. As the last element add -b where b is the hyperplane bias (you can find it in the model file coming out of SVMLight/libSVM training). Feed this N+1 dimensional detector to HOGDescriptor::setSVMDetector() and use HOGDescriptor::detect() or HOGDescriptor::detectMultiScale() for detection.

I have had successful results using SVMLight to learn SVM models when training from OpenCV, but haven't used cvSVM, so can't compare.
The hogDraw function from http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html will visualise your descriptor.

Related

OpenCV SVM Classifier Image Recognition

I am using C++ and OpenCV 3.3.1
I try to train a SVM with OpenCV my steps are:
Preprocess the Image
Feature Extraction with SURF
creating the DataSet for learning positiv and negative Images
reshape the Images 1 row 1 feature
creating the labelsmat with -1 for negativ and +1 for positiv
learning the SVM
predict
And now my problem:
Lets say my Images are 128 x 128 and after feature extraction i got a Mat
with 16 rows and 128 columns after reshape I got 1 row and 2048 columns, now is the SVM trained with this size of rows and columns. And when I try to predict with my SVM I got the problem that the SVM wants the same size of feature Mat ( 1 row and 2048 colum) but my image for prediciton got more features as the learning images so the Mat for prediction is a way bigger as needed.
The prediction with the same Image as I used for learning works well so i guess the SVM works.
How can I use the SVM for bigger Images?
Using SURF/SIFT descriptors by making them 1X 2048 feature is not a very good idea for two reasons:
You are restricting number of useful features for each image(=16) and if number of features is different from 16, you get the error. Even if you force to use 16 features everytime, you might end up loosing features and hence the results will degrade
You are training an SVM classifier for 2048 dimension, without utilizing any relation between extracted feature descriptors.
More robust and standard way of doing this is using Bag of Words.
You get K dimensional descriptor from SIFT features using bag of words and histogram approach and then you train SVM classifier on this K dimensional descriptors, wchih will be same of every image.
This link might be helpful for you,
https://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
If you want to use MATLAB; then vlfeat has the implementation of whole pipeline.

Best way to train a pedestrian detector using dlib

I am trying to train a pedestrian detector using dlib and the the INRIA Person Dataset.
So far I used 27 images, the training is fast but the results are unsatisfying (on other images pedestrians are rarely recognized). Here is the result of my training using the train_object_detector program that comes with dlib (in /exmaples directory) :
Saving trained detector to object_detector.svm
Testing detector on training data...
Test detector (precision,recall,AP): 1 0.653061 0.653061
Parameters used:
threads: 4
C: 1
eps: 0.01
target-size: 6400
detection window width: 47
detection window height: 137
upsample this many times : 0
I am aware that other images need to be added to the training in order to have better results but before doing that I want to be sure of the meaning of every parameter printed in the result (precision, recall, AP, c, eps, ...) I am also wondering if you have any recommandations regarding the training : what images to choose ? how many images are needed ? Do I need to annotate every object in the image ? Do I need to ignore some regions in the image ? ...
One last question, is there any trained detector (svm file) that I can use to compare my results ?
Thank you for your answers
I am not familiar with dlib in particular, but let me tell you that you will not get good results with 27 images. In order to generalize well, your classifier needs to see many images with a variety of data. It won't do you any good to supply it with 10,000 images of the same person, wearing the same outfit. You want different people, clothing, settings, angles, and lighting. The INRIA dataset should cover most of those.
Your detection window dimensions and upsample settings will determine how large people must look in the image in order for your trained classifier to detect them reliably. Your settings will detect only people at 1 scale where they are around 137/47 pixels tall/wide. If you upsample even once, you'll be able to detect people at a smaller scale (upsampling makes the person look bigger than they are). I suggest you use a larger dataset and increase the upsampling number (by how much you upsample is another discussion - that appears to be built into the library). Things will take longer, but that is the nature of training classifiers - tweak parameters, retrain, compare the results.
For precision/recall I'll refer you to this wikipedia article. These are not parameters, but results of your classifier. You want both to be as close to 1 as possible.

How to train a svm for classifying images of english alphabet?

My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.

Ratio of positive to negative data to use when training a cascade classifier (opencv)

So I'm using OpenCV's LBP detector. The shapes I'm detecting are all roughly circular (differing mostly in aspect ratio), with some wide changes in brightness/contrast, and a little bit of occlusion.
OpenCV's guide on how to train the detector is here
My main question to anyone with experience using it is how numPos and numNeg should be in relation to eachother? I have roughly 1000 positive samples (so ~900 being used per stage)
What I need to decide is how many negative samples to use per stage for training. I have about 20000 images from which to draw negative data, so redundancy isn't really an issue.
In general the rule I hear is 1:2, but that seems like under-utilization, given how much negative data I have at my disposal. On the flip side, what effects should I expect if I train my detector with 1:20? How should I determine the proper ratio?

OpenCv/C++ - Find similarities from a picture with a big database easily

I would like to do a comparison from a query with pictures in a database (about 2000).
Before posting on this website i read a lot of papers concerning methods for matching a picture in a big database and read a lot of posts on stackOverflow.
Concerning papers, there are some stuff interesting but quite technical and difficult to understand well the algorithms. (I just began to specialize myself in this field)
Posts (the most interesting) :
Simple and fast method to compare images for similarity ;
Nearest neighbors in high-dimensional data? ;
How to understand Locality Sensitive Hashing? ;
Image fingerprint to compare similarity of many images ;
C++/SIFT/SQL - If there a way to compare efficiently a SIFT descriptor of an image with a SIFT descriptor in a SQL database?
Papers :
Object retrieval with large vocabularies and fast spatial matching,
Image Similarity Search with Compact Data Structures,
LSH,
Near Duplicate Image Detection min-Hash and tf-idf Weighting
Vocabulary tree
Aggregating locals descriptors
But i'm still confusing.
The first thing i did is to implement BoW. I trained the Bag of Words (with ORB as detector and descriptor ,and used VLAD features) with 5 class in order to test its efficiency. After a long training, i launched it. It functioned well with an accuracy of 94 %. That's pretty good.
But there is a problem for me:
I don't want to do a classification. In my database, i'll have about 2000 differents pictures. I just want to find the best matches between my query and the database.
So if i have 2000 differents pictures,if i'm logical i have to consider these 2000 pictures as 2000 differents class and obviously that's impossible...
For this first thing, are you agree with me ? It's not obviously the best method to do what i would like ?
Maybe there is another way to use BoW in order to find similarities in the database ?
The second thing i did is « more simpler ».
I compute the descriptors of my query. Then i did a loop over all my database and i computed the descriptors of each picture and then added each descriptors in a vector.
std::vector<cv::Mat> all_descriptors_database;
for (i → 2000) :
cv::Mat request=cv::imread(img);
computeKeypoints(request) ;
computeDescriptors(request) ;
all_descriptors_database.pushback(descriptors_of_request)
At the end i have a big vector which contains all the descriptors of the all database. (The same with all the keypoints)
Then, this is here where i get confused.
At the beginning, i wanted to compute the matching inside the loop that is to say, for each image in the database, compute its descriptors and do a match with the query. But it tooks a lot of time.
So after reading a lot of paper about how find similarities in big databases, i found the LSH algorithm which seems to be appropriate for that kind of search.
Therefore i wanted to use this method.
So inside my loop i did something like that :
//Create Flann LSH index
cv::flann::Index flannIndex(all_descriptors_database.at(i), cv::flann::LshIndexParams(12, 20, 2), cvflann::FLANN_DIST_HAMMING);
cv::Mat results, dists;
int k=2; // find the 2 nearest neighbors
// search (nearest neighbor)
flannIndex.knnSearch(query_descriptors, results, dists, k, cv::flann::SearchParams() );
However i have some questions :
It tooks more than 5 seconds to loop all my database (2000) whereas i thought it will take less 1s (on the papers, they have huge databases not like me and LSH is more efficient). Did i do something wrong ?
I found on the internet some libraries which implement LSH like http://lshkit.sourceforge.net/ or http://www.mit.edu/~andoni/LSH/ . So what is the difference between these libraries and the four line of code i wrote using OpenCV ? Because i checked the libraries and for a kind of beginner like me, it was so difficult to try to use it.I got a bit confused.
The third thing :
I wanted to do a kind of fingerprint of each descriptors for each picture (in order to compute the Hamming distance with the database) but it seems to be impossible to do that. OpenCV / SURF How to generate a image hash / fingerprint / signature out of the descriptors?
So since 3 days, i'm blocked on that task. I don't know if i'm on the wrong way or not.
Maybe i missed something.
I hope it will be enough clear for you. Thank for reading
Your question is kind of big. I'll give you some hints, though.
Bag of Words can work but classification is unnecessary. BoW pipeline typically consists of:
keypoint detection - ORB
keypoint description (feature extraction) - ORB
quantization - VLAD (fisher encoding might be better, but plain old kmeans might be enough in your case)
classification - you probably can skip this stage
You can treat quantization results (e.g. VLAD encoding) for each image as its fingerprint. Computing distance between fingerprints will yield a similarity measure. Still you have to do a 1 vs all matching, which is going to be tremendously expensive when your database gets big enough.
I didn't get your point.
I'd suggest reading G. Hinton's papers (e.g. this one) on dimensionality reduction with deep autoencoders and convolutional neural networks. He boasts of beating LSH. As for the tools, I'd recommend taking a look on BVLC's Caffe, a great neural network library.