Perform multi-scale training (yolov2) - computer-vision

I am wondering how the multi-scale training in YOLOv2 works.
In the paper, it is stated that:
The original YOLO uses an input resolution of 448 × 448. ith the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. "Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training. "
I don't get how a network with only convolutional and pooling layers allow input of different resolutions. From my experience of building neural networks, if you change the resolution of the input to different scale, the number of parameters of this network will change, that is, the structure of this network will change.
So, how does YOLOv2 change this on the fly?
I read the configuration file for yolov2, but all I got was a random=1 statement...

if you only have convolutional layers, the number of weights does not change with the size of the 2D part of the layers (but it would change if you resized the number of channels, too).
for example (imagined network), if you have 224x224x3 input images and a 3x3x64 convolutional layer, you will have 64 different 3*3*3 convolutional filter kernels = 1728 weights. This value does not depend on the size of the image at all, since a kernel is applied on each position of the image independently, this is the most important thing of convolution and convolutional layers and the reason, why CNNs can go so deep, and why in faster R-CNN you can just crop the regions out of your feature map.
If there were any fully connected layers or something, it would not work this way, since there, bigger 2D layer dimension would lead to more connections and more weights.
In yolo v2, there is one thing that might look still not fitting right. For example if you double the image size in each dimension, you'll end up with 2 times the number of features in each dimension, right before the final 1x1xN filter, like if your grid was 7x7 for the original network size, the resized network might have 14x14. But then you'll just get 14x14 * B*(5+C) regression results, just fine.

In YoLo if you are only using convolution layers , the size of the output gird changes.
For example if you have size of:
320x320, output size is 10x10
608x608, output size is 19x19
You then calculate loss on these w.r.t to the ground truth grid which is similarly adjusted.
Thus you can back propagate loss without adding any more parameters.
Refer yolov1 paper for the loss function:
Loss Function from the paper
You thus can in theory only adjust this function which depends upon the grid size and no model parameters and you should be good to go.
Paper Link: https://arxiv.org/pdf/1506.02640.pdf
In the video explanation by the author mentions the same.
Time: 14:53
Video Link

Related

How CNN reduce parameter and reuse weight?

In the post A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way, it says
A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters.
The architecture performs a better fitting to the image dataset due to
the reduction in the number of parameters involved and reusability of
weights.
I don't see how it reduce parameter and reuse weight. Could anyone give an example?
Consider the filter (or kernel) in image below having 9 pixels and the image having 49 pixels.
In a fully connected layer, we'll have 9*49 = 441 weights.
While in a CNN this same filter keeps on moving (convolving) over the entire image. All pixel values in image will be multiplied with those same 9 values of filter (hence we say weights are reused). So, we need just 9 weights per filter instead of 441 in FC layer.
The job of a filter is to identify features (such as texture, lines etc), which could be anywhere in an image. So, we want to reuse this same filter over the entire image.
We can calculate the parameters for the Convolution layer using the formula: ((width_of_Kernel * height_of_Kernel * input_channel)+1) * output_channel
Here we can see Kernel size, Input Channel and, Output Channel are affecting number of parameters. By altering them, we can reduce the parameters and it will result in reducing size.

Why in CNN for image recognition tasks, the filters are always chosen to be extremely localized?

In CNN, the filters are usually set as 3x3, 5x5 spatially. Can the sizes be comparable to the image size? One reason is for reducing the number of parameters to be learnt. Apart from this, is there any other key reasons? for example, people want to detect edges first?
You answer a point of the question. Another reason is that most of these useful features may be found in more than one place in an image. So, it makes sense to slide a single kernel all over the image in the hope of extracting that feature in different parts of the image using the same kernel. If you are using big kernel, the features could be interleaved and not concretely detected.
In addition to yourself answer, reduction in computational costs is a key point. Since we use the same kernel for different set of pixels in an image, the same weights are shared across these pixel sets as we convolve on them. And as the number of weights are less than a fully connected layer, we have lesser weights to back-propagate on.

Change resolution after training (have got a pre-trained model)

Reading the YOLOv1 paper, it is mentioned[1] that the first part of the network, that is, those convolutional layers, are first trained at a input resolution of 224x224 on the ImageNet dataset. After that, the model is converted to perform detection, in which the input resolution is increased from 224x224 to 448x448. I am wondering that how can this convertion be done: if the input of the network is at first 224x224, then the number of parameters should differ from that of 448x448, which means that the convolutional layers trained on the ImageNet dataset cannot be reused for detection.
What am I missing here?
[1]: At the end of section "2.2 Training"
if the input of the network is at first 224x224, then the number of parameters should differ from that of 448x448
This is your misunderstanding.
The convolution operation has no constraints on the size of the input and thus on the size of the output. When you train a CNN that has fully connected layers at the end for classification, then you're constraining the input to be of a fixed size, because the number of input that a FC layer can accept is fixed.
But, if you remove the classification head from the network and you only use the trained weights of the CNN as a feature extractor, you'll notice that given an input of any dimension (>= the dimension the network has been trained on), the output will be a set of feature maps whose spatial extent increase as the spatial extent of the input increases.
In YOLO, hence, the network is initially trained to perform classificationm with a resolution of 224x224. In this way weights of the convolution operation + the weights of the FC layers at the end learned to extract & classify meaningful features.
After this first training, the FC layers are thrown away and only the feature extraction part is kept. In this way, you can use a good feature extractor, that already learned to extract meaningful features, in a convolutional fashion (ei, producing not a feature vector but a feature map as output, that can be post-processed as YOLO does)

Is there a heuristic for homogenizing image dimensions before using them to train neural net?

I am training a neural net on a set of images with heterogeneous dimensions. Of course, they all have to have the same dimensions to be fed to the NN, and it is simple enough to use scipy.misc.imresize() for this. But, how should I choose width and height? My first instinct was to plot histograms of both and eyeball values around the 75th percentile. I also thought maybe I should scale all images up to the max values for both height and width, so that no details are discarded from the higher-pixel images. Is there a best practice for addressing this problem? Thanks!
For reference, I am using python 2.7 and keras with theano backend and dimension ordering.
I don't think there is a standard approach on this. In machine learning, in many cases we have to try and see.
If I were you, if I had to build a custom neural network, I would start with mean image size and then I would gradually increase the size until reaching optimum score.
If you are using a pretrained neural network then just resize your images to network's default.

Setting up a CNN network in keras?

I am currently trying to implement a cnn network, which can map an input to an output.
The input consist of stft of audio files, and the output is a feature vector.
Due to the different length of audio files, will the number of total samples always be different, but each sample has a frame length of 25 ms and 10 ms overlap. shape(x,2050)
The output is a feature vector shape is (x,13).
I thought the use of cnn seemed appropriate here as the stft as the each input contains some information of the previous sample due to the overlap.
Is it possible in keras to design a model, which make use of this, so the there will be calculated a convolutional sum for each row of the matrix, and somehow make it aware of the 25 frame length and the 10 overlap.
Yes it is, see line 220 of this file [1]. This is an implementation of Wavenet in Keras using convolutions. Even though they've created wrapper layers, this should give you the intuition on how to model audio samples.
[1] https://github.com/basveeling/wavenet/blob/master/wavenet.py#L220