I am using C++ and OpenCV 3.3.1
I try to train a SVM with OpenCV my steps are:
Preprocess the Image
Feature Extraction with SURF
creating the DataSet for learning positiv and negative Images
reshape the Images 1 row 1 feature
creating the labelsmat with -1 for negativ and +1 for positiv
learning the SVM
predict
And now my problem:
Lets say my Images are 128 x 128 and after feature extraction i got a Mat
with 16 rows and 128 columns after reshape I got 1 row and 2048 columns, now is the SVM trained with this size of rows and columns. And when I try to predict with my SVM I got the problem that the SVM wants the same size of feature Mat ( 1 row and 2048 colum) but my image for prediciton got more features as the learning images so the Mat for prediction is a way bigger as needed.
The prediction with the same Image as I used for learning works well so i guess the SVM works.
How can I use the SVM for bigger Images?
Using SURF/SIFT descriptors by making them 1X 2048 feature is not a very good idea for two reasons:
You are restricting number of useful features for each image(=16) and if number of features is different from 16, you get the error. Even if you force to use 16 features everytime, you might end up loosing features and hence the results will degrade
You are training an SVM classifier for 2048 dimension, without utilizing any relation between extracted feature descriptors.
More robust and standard way of doing this is using Bag of Words.
You get K dimensional descriptor from SIFT features using bag of words and histogram approach and then you train SVM classifier on this K dimensional descriptors, wchih will be same of every image.
This link might be helpful for you,
https://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
If you want to use MATLAB; then vlfeat has the implementation of whole pipeline.
Related
I have a dataset of around 20K images that are human labelled. Labels are as follows:
Label = 1 if the image is sharp and well lit, and
Label = 0 for those blurry/out of focus/grainy images.
The images are of documents such as Identity cards.
I want to build a Computer Vision model that can do the classification task.
I tried using VGG-16 for transfer learning for this task but it did not give good results (precision .65 and recall = .73). My sense is that VGG-16 is not suitable for this task. It is trained on ImageNet and has very different low level features. Interestingly the model is under-fitting.
We also tried EfficientNet 7. Though the model was able to decently perform on training and validation, test performance remains bad.
Can someone suggest more suitable model to try for this task?
I think your problem with VGG and other NN is the resizing of images:
VGG expects as input 224x224 size image. I assume your dataset has much larger resolution, and thus you significantly downscale the input images before feeding them to your network.
What happens to blur/noise when you downscale an image?
Blurry and noisy images become sharper and cleaner as you decrease the resolution. Therefore, in many of your training examples, the net sees a perfectly good image while you label them as "corrupt". This is not good for training.
An interesting experiment would be to see what types of degradations your net can classify correctly and what types it fails: You report 65% precision # 73% recall. Can you look at the classified images at that point and group them by degradation type?
That is, what is precision/recall for only blurry images? what is it for noisy images? What about grainy images?
What can you do?
Do not resize images at all! if the network needs fixed size input - then crop rather than resize.
Taking advantage of the "resizing" effect, you can approach the problem using a "discriminator". Train a network that "discriminate" between an image and its downscaled version. If the image is sharp and clean - this discriminator will find it difficult to succeed. However, for blurred/noisy images the task should be rather easy.
For this task, I think using opencv is sufficient to solve the issue. In fact comparing the variance of Lablacien of the image with a threshold (cv2.Laplacian(image, cv2.CV_64F).var()) will generate a decision if an image is bluered or not.
You ca find an explanation of the method and the code in the following tutorial : detection with opencv
I think that training a classifier that takes the output of one of one of your neural network models and the variance of Laplacien as features will improve the classification results.
I also recommend experementing with ResNet and DenseNet.
I would look at the change in color between pixels, then rank the photos on the median delta between pixels... a sharp change from RGB (0,0,0) to (255,255,255) on each of the adjoining pixels would be the max possible score, the more blur you have the lower the score.
I have done this in the past trying to estimate areas of fields with success.
I am trying to train a pedestrian detector using dlib and the the INRIA Person Dataset.
So far I used 27 images, the training is fast but the results are unsatisfying (on other images pedestrians are rarely recognized). Here is the result of my training using the train_object_detector program that comes with dlib (in /exmaples directory) :
Saving trained detector to object_detector.svm
Testing detector on training data...
Test detector (precision,recall,AP): 1 0.653061 0.653061
Parameters used:
threads: 4
C: 1
eps: 0.01
target-size: 6400
detection window width: 47
detection window height: 137
upsample this many times : 0
I am aware that other images need to be added to the training in order to have better results but before doing that I want to be sure of the meaning of every parameter printed in the result (precision, recall, AP, c, eps, ...) I am also wondering if you have any recommandations regarding the training : what images to choose ? how many images are needed ? Do I need to annotate every object in the image ? Do I need to ignore some regions in the image ? ...
One last question, is there any trained detector (svm file) that I can use to compare my results ?
Thank you for your answers
I am not familiar with dlib in particular, but let me tell you that you will not get good results with 27 images. In order to generalize well, your classifier needs to see many images with a variety of data. It won't do you any good to supply it with 10,000 images of the same person, wearing the same outfit. You want different people, clothing, settings, angles, and lighting. The INRIA dataset should cover most of those.
Your detection window dimensions and upsample settings will determine how large people must look in the image in order for your trained classifier to detect them reliably. Your settings will detect only people at 1 scale where they are around 137/47 pixels tall/wide. If you upsample even once, you'll be able to detect people at a smaller scale (upsampling makes the person look bigger than they are). I suggest you use a larger dataset and increase the upsampling number (by how much you upsample is another discussion - that appears to be built into the library). Things will take longer, but that is the nature of training classifiers - tweak parameters, retrain, compare the results.
For precision/recall I'll refer you to this wikipedia article. These are not parameters, but results of your classifier. You want both to be as close to 1 as possible.
I am trying to classify MRI images of brain tumors into benign and malignant using C++ and OpenCV. I am planning on using bag-of-words (BoW) method after clustering SIFT descriptors using kmeans. Meaning, I will represent each image as a histogram with the whole "codebook"/dictionary for the x-axis and their occurrence count in the image for the y-axis. These histograms will then be my input for my SVM (with RBF kernel) classifier.
However, the disadvantage of using BoW is that it ignores the spatial information of the descriptors in the image. Someone suggested to use SPM instead. I read about it and came across this link giving the following steps:
Compute K visual words from the training set and map all local features to its visual word.
For each image, initialize K multi-resolution coordinate histograms to zero. Each coordinate histogram consist of L levels and each level
i has 4^i cells that evenly partition the current image.
For each local feature (let's say its visual word ID is k) in this image, pick out the k-th coordinate histogram, and then accumulate one
count to each of the L corresponding cells in this histogram,
according to the coordinate of the local feature. The L cells are
cells where the local feature falls in in L different resolutions.
Concatenate the K multi-resolution coordinate histograms to form a final "long" histogram of the image. When concatenating, the k-th
histogram is weighted by the probability of the k-th visual word.
To compute the kernel value over two images, sum up all the cells of the intersection of their "long" histograms.
Now, I have the following questions:
What is a coordinate histogram? Doesn't a histogram just show the counts for each grouping in the x-axis? How will it provide information on the coordinates of a point?
How would I compute the probability of the k-th visual word?
What will be the use of the "kernel value" that I will get? How will I use it as input to SVM? If I understand it right, is the kernel value is used in the testing phase and not in the training phase? If yes, then how will I train my SVM?
Or do you think I don't need to burden myself with the spatial info and just stick with normal BoW for my situation(benign and malignant tumors)?
Someone please help this poor little undergraduate. You'll have my forever gratefulness if you do. If you have any clarifications, please don't hesitate to ask.
Here is the link to the actual paper, http://www.csd.uwo.ca/~olga/Courses/Fall2014/CS9840/Papers/lazebnikcvpr06b.pdf
MATLAB code is provided here http://web.engr.illinois.edu/~slazebni/research/SpatialPyramid.zip
Co-ordinate histogram (mentioned in your post) is just a sub-region in the image in which you compute the histogram. These slides explain it visually, http://web.engr.illinois.edu/~slazebni/slides/ima_poster.pdf.
You have multiple histograms here, one for each different region in the image. The probability (or the number of items would depend on the sift points in that sub-region).
I think you need to define your pyramid kernel as mentioned in the slides.
A Convolutional Neural Network may be better suited for your task if you have enough training samples. You can probably have a look at Torch or Caffe.
My objective is to train an SVM and get support vectors which i can plug into opencv's HOGdescriptor for object detection.
I have gathered 4000~ positives and 15000~ negatives and I train using the SVM provided by opencv. the results give me too many false positives.(up to 20 per image) I would clip out the false positives and add them into the pool of negatives to retrain. and I would end up with even more false positives at times! I have tried adjusting L2HysThreshold of my hogdescriptor upwards to 300 without significant improvement. is my pool of positives and negatives large enough?
the SVM training is also much faster than expected. I have tried with a feature vector size of 2916 and 12996, using grayscale images and color images on separate tries. SVM training has never taken longer than 20 minutes. I use auto_train. I am new to machine learning but from what i hear training with a dataset as large as mine should take at least a day no?
I believe cvSVM is not doing much learning and according to http://opencv-users.1802565.n2.nabble.com/training-a-HOG-descriptor-td6363437.html, it is not suited for this purpose. does anyone with experience with cvSVM have more input on this?
I am considering using SVMLight http://svmlight.joachims.org/ but it looks like there isn't a way to visualize the SVM hyperplane. What are my options?
I use opencv2.4.3 and have tried the following setsups for hogdescriptor
hog.winSize = cv::Size(100,100);
hog.cellSize = cv::Size(5,5);
hog.blockSize = cv::Size(10,10);
hog.blockStride = cv::Size(5,5); //12996 feature vector
hog.winSize = cv::Size(100,100);
hog.cellSize = cv::Size(10,10);
hog.blockSize = cv::Size(20,20);
hog.blockStride = cv::Size(10,10); //2916 feature vector
Your first descriptor dimension is way too large to be any useful. To form any reliable SVM hyperplane, you need at least the same number of positive and negative samples as your descriptor dimensions. This is because ideally you need separating information in every dimension of the hyperplane.
The number of positive and negative samples should be more or less the same unless you provide your SVM trainer with a bias parameter (may not be available in cvSVM).
There is no guarantee that HOG is a good descriptor for the type of problem you are trying to solve. Can you visually confirm that the object you are trying to detect has a distinct shape with similar orientation in all samples? A single type of flower for example may have a unique shape, however many types of flowers together don't have the same unique shape. A bamboo has a unique shape but may not be distinguishable from other objects easily, or may not have the same orientation in all sample images.
cvSVM is normally not the tool used to train SVMs for OpenCV HOG. Use the binary form of SVMLight (not free for commercial purposes) or libSVM (ok for commercial purposes). Calculate HOGs for all samples using your C++/OpenCV code and write it to a text file in the correct input format for SVMLight/libSVM. Use either of the programs to train a model using linear kernel with the optimal C. Find the optimal C by searching for the best accuracy while changing C in a loop. Calculate the detector vector (a N+1 dimensional vector where N is the dimension of your descriptor) by finding all the support vectors, multiplying alpha values by each corresponding support vector, and then for each dimension adding all the resulting alpha * values to find an ND vector. As the last element add -b where b is the hyperplane bias (you can find it in the model file coming out of SVMLight/libSVM training). Feed this N+1 dimensional detector to HOGDescriptor::setSVMDetector() and use HOGDescriptor::detect() or HOGDescriptor::detectMultiScale() for detection.
I have had successful results using SVMLight to learn SVM models when training from OpenCV, but haven't used cvSVM, so can't compare.
The hogDraw function from http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html will visualise your descriptor.
I understand that Histograms of Gradients in OpenCV are typically used on image patches in order to detect and classify objects in an image.
However, I would like to use HOG to build a feature vector that can be used to classify an entire image. Using the following:
std::vector<float> temp_FV_out;
cv::HOGDescriptor hog;
hog.compute(img_in, temp_FV_out);
gives very long feature vectors each of different lengths, due to the varying size of the image - larger images have more 64 x 128 windows, and each of these contributes to the feature vector's length.
How can I get OpenCV to give a short feature vector (about 5-20 bins) from each image, where the length of the feature vector remains constant regardless of the image's size? I would rather not use bag of words to build a dictionary of HOG 'words'.
First step is to normalize the image size - choose the smallest size you want to process,and resize the rest to this base size. You can also establish a small size as default (100x100, by example) You may need to crop them, if they do not have the same aspect ratio.
Next, you can select a number of features from your vector, based on various algorithms: PCA, decision trees, Ada boost, etc - which can help you extract the most significant values from your data.