Reading the YOLOv1 paper, it is mentioned[1] that the first part of the network, that is, those convolutional layers, are first trained at a input resolution of 224x224 on the ImageNet dataset. After that, the model is converted to perform detection, in which the input resolution is increased from 224x224 to 448x448. I am wondering that how can this convertion be done: if the input of the network is at first 224x224, then the number of parameters should differ from that of 448x448, which means that the convolutional layers trained on the ImageNet dataset cannot be reused for detection.
What am I missing here?
[1]: At the end of section "2.2 Training"
if the input of the network is at first 224x224, then the number of parameters should differ from that of 448x448
This is your misunderstanding.
The convolution operation has no constraints on the size of the input and thus on the size of the output. When you train a CNN that has fully connected layers at the end for classification, then you're constraining the input to be of a fixed size, because the number of input that a FC layer can accept is fixed.
But, if you remove the classification head from the network and you only use the trained weights of the CNN as a feature extractor, you'll notice that given an input of any dimension (>= the dimension the network has been trained on), the output will be a set of feature maps whose spatial extent increase as the spatial extent of the input increases.
In YOLO, hence, the network is initially trained to perform classificationm with a resolution of 224x224. In this way weights of the convolution operation + the weights of the FC layers at the end learned to extract & classify meaningful features.
After this first training, the FC layers are thrown away and only the feature extraction part is kept. In this way, you can use a good feature extractor, that already learned to extract meaningful features, in a convolutional fashion (ei, producing not a feature vector but a feature map as output, that can be post-processed as YOLO does)
Related
In the post A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way, it says
A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters.
The architecture performs a better fitting to the image dataset due to
the reduction in the number of parameters involved and reusability of
weights.
I don't see how it reduce parameter and reuse weight. Could anyone give an example?
Consider the filter (or kernel) in image below having 9 pixels and the image having 49 pixels.
In a fully connected layer, we'll have 9*49 = 441 weights.
While in a CNN this same filter keeps on moving (convolving) over the entire image. All pixel values in image will be multiplied with those same 9 values of filter (hence we say weights are reused). So, we need just 9 weights per filter instead of 441 in FC layer.
The job of a filter is to identify features (such as texture, lines etc), which could be anywhere in an image. So, we want to reuse this same filter over the entire image.
We can calculate the parameters for the Convolution layer using the formula: ((width_of_Kernel * height_of_Kernel * input_channel)+1) * output_channel
Here we can see Kernel size, Input Channel and, Output Channel are affecting number of parameters. By altering them, we can reduce the parameters and it will result in reducing size.
I am wondering how the multi-scale training in YOLOv2 works.
In the paper, it is stated that:
The original YOLO uses an input resolution of 448 × 448. ith the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. "Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training. "
I don't get how a network with only convolutional and pooling layers allow input of different resolutions. From my experience of building neural networks, if you change the resolution of the input to different scale, the number of parameters of this network will change, that is, the structure of this network will change.
So, how does YOLOv2 change this on the fly?
I read the configuration file for yolov2, but all I got was a random=1 statement...
if you only have convolutional layers, the number of weights does not change with the size of the 2D part of the layers (but it would change if you resized the number of channels, too).
for example (imagined network), if you have 224x224x3 input images and a 3x3x64 convolutional layer, you will have 64 different 3*3*3 convolutional filter kernels = 1728 weights. This value does not depend on the size of the image at all, since a kernel is applied on each position of the image independently, this is the most important thing of convolution and convolutional layers and the reason, why CNNs can go so deep, and why in faster R-CNN you can just crop the regions out of your feature map.
If there were any fully connected layers or something, it would not work this way, since there, bigger 2D layer dimension would lead to more connections and more weights.
In yolo v2, there is one thing that might look still not fitting right. For example if you double the image size in each dimension, you'll end up with 2 times the number of features in each dimension, right before the final 1x1xN filter, like if your grid was 7x7 for the original network size, the resized network might have 14x14. But then you'll just get 14x14 * B*(5+C) regression results, just fine.
In YoLo if you are only using convolution layers , the size of the output gird changes.
For example if you have size of:
320x320, output size is 10x10
608x608, output size is 19x19
You then calculate loss on these w.r.t to the ground truth grid which is similarly adjusted.
Thus you can back propagate loss without adding any more parameters.
Refer yolov1 paper for the loss function:
Loss Function from the paper
You thus can in theory only adjust this function which depends upon the grid size and no model parameters and you should be good to go.
Paper Link: https://arxiv.org/pdf/1506.02640.pdf
In the video explanation by the author mentions the same.
Time: 14:53
Video Link
I am working on the comparison of Histogram of oriented gradient (HoG) and Convolutional Neural Network (CNN) for the weed detection. I have two datasets of two different weeds.
CNN architecture is 3 layer network.
1) 1st dataset contains two classes and have 18 images.
The dataset is increased using data augmentation (rotation, adding noise, illumination changes)
Using the CNN I am getting a testing accuracy of 77% and for HoG with SVM 78%.
2) Second dataset contact leaves of two different plants.
each class contain 2500 images without data augmentation.
For this dataset, using CNN I am getting a test accuracy of 94% and for HoG with SVM 80%.
My question is Why I am getting higher accuracy for HoG using first dataset? CNN should be much better than HoG.
The only reason comes to my mind is the first data has only 18 images and less diverse as compare to the 2nd dataset. is it correct?
Yes, your intuition is right, having this small data set (just 18 images before data augmentation) can cause the worse performance. In general, for CNNs you usually need at least thousands of images. SVMs do not perform that bad because of the regularization (that you most probably use) and because of the probably much lower number of parameters the model has. There are ways how to regularize deep nets, e.g., with your first data set you might want to give dropout a try, but I would rather try to acquire a substantially larger data set.
My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.
My objective is to train an SVM and get support vectors which i can plug into opencv's HOGdescriptor for object detection.
I have gathered 4000~ positives and 15000~ negatives and I train using the SVM provided by opencv. the results give me too many false positives.(up to 20 per image) I would clip out the false positives and add them into the pool of negatives to retrain. and I would end up with even more false positives at times! I have tried adjusting L2HysThreshold of my hogdescriptor upwards to 300 without significant improvement. is my pool of positives and negatives large enough?
the SVM training is also much faster than expected. I have tried with a feature vector size of 2916 and 12996, using grayscale images and color images on separate tries. SVM training has never taken longer than 20 minutes. I use auto_train. I am new to machine learning but from what i hear training with a dataset as large as mine should take at least a day no?
I believe cvSVM is not doing much learning and according to http://opencv-users.1802565.n2.nabble.com/training-a-HOG-descriptor-td6363437.html, it is not suited for this purpose. does anyone with experience with cvSVM have more input on this?
I am considering using SVMLight http://svmlight.joachims.org/ but it looks like there isn't a way to visualize the SVM hyperplane. What are my options?
I use opencv2.4.3 and have tried the following setsups for hogdescriptor
hog.winSize = cv::Size(100,100);
hog.cellSize = cv::Size(5,5);
hog.blockSize = cv::Size(10,10);
hog.blockStride = cv::Size(5,5); //12996 feature vector
hog.winSize = cv::Size(100,100);
hog.cellSize = cv::Size(10,10);
hog.blockSize = cv::Size(20,20);
hog.blockStride = cv::Size(10,10); //2916 feature vector
Your first descriptor dimension is way too large to be any useful. To form any reliable SVM hyperplane, you need at least the same number of positive and negative samples as your descriptor dimensions. This is because ideally you need separating information in every dimension of the hyperplane.
The number of positive and negative samples should be more or less the same unless you provide your SVM trainer with a bias parameter (may not be available in cvSVM).
There is no guarantee that HOG is a good descriptor for the type of problem you are trying to solve. Can you visually confirm that the object you are trying to detect has a distinct shape with similar orientation in all samples? A single type of flower for example may have a unique shape, however many types of flowers together don't have the same unique shape. A bamboo has a unique shape but may not be distinguishable from other objects easily, or may not have the same orientation in all sample images.
cvSVM is normally not the tool used to train SVMs for OpenCV HOG. Use the binary form of SVMLight (not free for commercial purposes) or libSVM (ok for commercial purposes). Calculate HOGs for all samples using your C++/OpenCV code and write it to a text file in the correct input format for SVMLight/libSVM. Use either of the programs to train a model using linear kernel with the optimal C. Find the optimal C by searching for the best accuracy while changing C in a loop. Calculate the detector vector (a N+1 dimensional vector where N is the dimension of your descriptor) by finding all the support vectors, multiplying alpha values by each corresponding support vector, and then for each dimension adding all the resulting alpha * values to find an ND vector. As the last element add -b where b is the hyperplane bias (you can find it in the model file coming out of SVMLight/libSVM training). Feed this N+1 dimensional detector to HOGDescriptor::setSVMDetector() and use HOGDescriptor::detect() or HOGDescriptor::detectMultiScale() for detection.
I have had successful results using SVMLight to learn SVM models when training from OpenCV, but haven't used cvSVM, so can't compare.
The hogDraw function from http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html will visualise your descriptor.