Questions about R-CNN, RPN, anchor box - computer-vision

My brief understanding of Faster R-CNN
S1. assign anchor box
S2. finding anchor box objectness score (rather positive, negative, non) using ground truth (ROI pooling?)
S3. select a set of positive, negative boxes to train region proposal network (training for predicting objectness score in test set?)
S4-1. training classification layer using positive anchor box
S4-2. training box regression layer using positive anchor box
.....
Questions
Q1. When an anchor box has high IOU with several ground truth boxes is the highest box chosen as the target? Can an anchor box only have on target?
Q2. How is a positive anchor box and its target box matched in code? (Most explanation says an anchor box object contains four variables; center x, center y, width, height, which has no value for its target.)
Q3. Is ROI pooling referring to stage two?
Q4. Is the third stage training used to predict the objectness score for test set?
Q5. Is there a reason other than training speed, for not training anchor boxes that are not labeled as either positive or negative? (Aren’t all boxes’ objectness score estimated when training? Including positive, negative, non)
Q6. Shouldn’t 4-1 classification happen after moving of the box during 4-2 box regression? (Explanations say the two layers are independent and happen simultaneously) Shouldn’t the two layers have order?
Q7. Is the probability of an anchor box used for NMS referring to the classification score calculated in stage 4-1?
Q8. Unlike the RPN stage where all anchor boxes are used for training, the classification and regression stages only use a few positive anchor boxes. How does back-propagation happen when training a model like this, where some stages use only a part of the training set? (I heard the advantages of Faster-RCNN is the connection of all stages as a single deep-learning model. But if the latter models only use chosen boxes and if the classification/box regression stage works independently how can the full model work as one?)

Related

How anchor box mechanism and NMS work for Faster R-CNN?

How can anchor box and NMS be applied to a test set when the Ground Truth is unknown?
Without the Ground Truth, the IOU of the anchor box is unknown, so are all the anchor boxes used to predict the box location?
Then, why are they used in training the first place.
1 - You don't need GT in the validation or test phase except measuring the performance of model. Because the model have already learned the regression coefficients in the training phase. These coefficients are used to fit the default anchor boxes to predictions.
2 - NMS eliminates the region proposals without using GT. It aims to select best proposal over the other proposals while comparing them against each other.
3 - If you don't know the GT you can't learn the classes and coordinates of objects during the training phase. So you need compare your GT with predictions.
You can check out this great article to have comprehensive understanding of R-CNN family.

Generating anchor boxes using K-means clustering , YOLO

I am trying to understand working of YOLO and how it detects object in an image. My question is, what role does k-means clustering play in detecting bounding box around the object? Thanks.
K -means clustering algorithm is very famous algorithm in data science. This algorithm aims to partition n observation to k clusters. Mainly it includes :
Initialization : K means (i.e centroid) are generated at random.
Assignment : Clustering formation by associating the each observation with nearest centroid.
Updating Cluster : Centroid of a newly created cluster becomes mean.
Assignment and Update are repitatively occurs untill convergence.
The final result is that the sum of squared errors is minimized between points and their respective centroids.
EDIT :
Why use K means
K-means is computationally faster and more efficient compare to other unsupervised learning algorithms. Don't forget time complexity is linear.
It produces a higher cluster then the hierarchical clustering. More number of cluster helps to get more accurate end result.
Instance can change cluster (move to another cluster) when centroid are re-computed.
Works well even if some of your assumption are broken.
what it really does in determining anchor box
It will create a thouasands of anchor box (i.e Clusters in k-means) for each predictor that represent shape, location, size etc.
For each anchor box, calculate which object’s bounding box has the highest overlap divided by non-overlap. This is called Intersection Over Union or IOU.
If the highest IOU is greater than 50% ( This can be customized), tell the anchor box that it should detect the object it has highest IOU.
Otherwise if the IOU is greater than 40%, tell the neural network that the true detection is ambiguous and not to learn from that example.
If the highest IOU is less than 40%, then it should predict that there is no object.
Thanks!
In general, bounding boxes for objects are given by tuples of the form
(x0,y0,x1,y1) where x0,y0 are the coordinates of the lower left corner and x1,y1 are the coordinates of the upper right corner.
Need to extract width and height from these coordinates, and normalize data with respect to image width and height.
Metric for K-mean
Euclidean distance
IoU (Jaccard index)
IoU turns out to better than former
Jaccard index = (Intersection between selected box and cluster head box)/(Union between selected box and cluster head box)
At initialization we can choose k random boxes as our cluster heads. Assign anchor boxes to respective clusters based on IoU value > threshold and calculate mean IoU of cluster.
This process can be repeated until convergence.

What's the difference between "BB regression algorithms used in R-CNN variants" vs "BB in YOLO" localization techniques?

Question:
What's the difference between the bounding box(BB) produced by "BB regression algorithms in region-based object detectors" vs "bounding box in single shot detectors"? and can they be used interchangeably if not why?
While understanding variants of R-CNN and Yolo algorithms for object detection, I came across two major techniques to perform object detection i.e Region-based(R-CNN) and niche-sliding window based(YOLO).
Both use different variants(complicated to simple) in both regimes but in the end, they are just localizing objects in the image using Bounding boxes!. I am just trying to focus on the localization(assuming classification is happening!) below since that is more relevant to the question asked & explained my understanding in brief:
Region-based:
Here, we let the Neural network to predict continuous variables(BB coordinates) and refers to that as regression.
The regression that is defined (which is not linear at all), is just a CNN or other variants(all layers were differentiable),outputs are four values (𝑟,𝑐,ℎ,𝑤), where (𝑟,𝑐) specify the values of the position of the left corner and (ℎ,𝑤) the height and width of the BB.
In order to train this NN, a smooth L1 loss was used to learn the precise BB by penalizing when the outputs of the NN are very different from the labeled (𝑟,𝑐,ℎ,𝑤) in the training set!
niche-Sliding window(convolutionally implemented!) based:
first, we divide the image into say 19*19 grid cells.
the way you assign an object to a grid-cell is by selecting the midpoint of an object and then assigning that object to whichever one grid cell contains the midpoint of the object. So each object, even if the objects span multiple grid cells, that object is assigned only to one of the 19 by 19 grid cells.
Now, you take the two coordinates of this grid-cell and calculate the precise BB(bx, by, bh, bw) for that object using some method such as
(bx, by, bh, bw) are relative to the grid cell where x & y are center point and h & w are the height of precise BB i.e the height of the bounding box is specified as a fraction of the overall width of the grid cell and h& w can be >1.
There multiple ways of calculating precise BB specified in the paper.
Both Algorithms:
outputs precise bounding boxes.!
works in supervised learning settings, they were using labeled dataset where the labels are bounding boxes stored(manually marked my some annotator using tools like labelimg ) for each image in a JSON/XML file format.
I am trying to understand the two localization techniques on a more abstract level(as well as having an in-depth idea of both techniques!) to get more clarity on:
in what sense they are different?, &
why 2 were created, I mean what are the failure/success points of 1 on the another?.
and can they be used interchangeably, if not then why?
please feel free to correct me if I am wrong somewhere, feedback is highly appreciated! Citing to any particular section of a research paper would be more rewarding!
The essential differences are that two-stage Faster R-CNN-like are more accurate while single-stage YOLO/SSD-like are faster.
In two-stage architectures, the first stage is usually of region proposal, while the second stage is for classification and more accurate localization. You can think of the first stage as similar to the single-stage architectures, when the difference is that the region proposal only separates "object" from "background", while the single-stage distinguishes between all object classes. More explicitly, in the first stage, also in a sliding window-like fashion, an RPN says whether there's an object present or not, and if there is - to roughly give the region (bounding box) in which it lies. This region is used by the second stage for classification and bounding box regression (for better localization) by first pooling the relevant features from the proposed region, and then going through the Fast R-CNN-like architecture (which does the classificaion+regression).
Regarding your question about interchanging between them - why would you want to do so? Usually you would choose an architecture by your most pressing needs (e.g. latency/power/accuracy), and you wouldn't want to interchange between them unless there's some sophisticated idea which will help you somehow.

Conceptual Question Regarding the Yolo Object Detection Algorithm

My understanding is that the motivation for Anchor Boxes (in the Yolo v2 algorithm) is that in the first version of Yolo (Yolo v1) it is not possible to detect multiple objects in the same grid box. I don't understand why this is the case.
Also, the original paper by the authors (Yolo v1) has the following quote:
"Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts."
Doesn't this indicate that a grid cell can recognize more than one object? In their paper, they take B as 2. Why not take B as some arbitrarily higher number, say 10?
Second question: how are the Anchor Box dimensions tied to the Bounding Box dimensions, for detecting a particular object? Some websites say that the Anchor Box defines a shape only, and others say that it defines a shape and a size. In either case, how is the Anchor Box tied to the Bounding Box?
Thanks,
Sandeep
You're right that YOLOv1 has multiple (B) bounding boxes, but these are not assigned to ground truths in an effective or systematic way, and therefore also don't infer bounding boxes accurate enough.
As you can read on blog posts over the internet, an Anchor/Default Box is a box in the original image which corresponds to a specific cell in a specific feature map, which is assigned with specific aspect ratio and scale.
The scale is usually dictated by the feature map (deeper feature map -> large anchor scale), and the aspect ratio vary, e.g. {1:1, 1:2, 2:1} or {1:1, 1:2, 2:1, 1:3, 3:1}.
By the scale and aspect ratio, a specific shape is dictated, and this shape, with a position dictated by the position of the current cell in the feature map, is compared to ground truth bounding boxes in the original image.
Different papers have different assignment schemes, but it's usually goes like this: (1) if the IoU of the anchor on the original image with a GT is over some threshold (e.g. 0.5), then this is a positive assignment to the anchor, (2) if it's under some threshold (e.g. 0.1), then it's a negative assignment, and (3) if there's a gap between these two thresholds - then the anchors in between are ignored (in the loss computation).
This way, an anchor is in fact like a "detector head" responsible for specific cases, which are the most similar to it shape-wise. It is therefore responsible to detect objects with shape similar to it, and it infers both confidence to each class, and bounding box parameters relative to it, i.e. how much to modify the anchor's height, width, and center (in the two axes) to receive the correct bounding box.
Because of this assignment scheme, which distributes the responsibility effectively between the different anchors, the bounding box prediction is more accurate.
Another downside to YOLOv1's scheme is the fact that it decouples bounding box and classification. On one hand, this saves computation, but on the other hand - the classification is on the level of grid cell. Therefore the B options for bounding boxes all have the same class prediction. This means, for example, that if there are multiple objects of different class with the same center (e.g. person holding a cat), then the classification of at least all but one will be wrong. Note that it is theoretically possible that other predictions of adjacent grid cells will compensate on this wrong classification, but it is not promised, in particular since by the YOLOv1's scheme, the center is the assignment criteria.

Is there a way to identify regions that are not very similar from a set of images?

Given an image, I would like to extract more subimages from it, but the resulting subimages must not be overly similar to each other. If the center of each ROI should be chosen randomly, then we must make sure that each subimage has at most only a small percentage of area in common with other subimages.
Or we could decompose the image into small regions over a regular grid, then I randomly choose a subimage within each region. This option, however, does not ensure that all subimages are sufficiently different from each other. Obviously I have to choose a good way to compare the resulting subimages, but also a similarity threshold.
The above procedure must be performed on many images: all the extracted subimages should not be too similar. Is there a way to identify regions that are not very similar from a set of images (for eg by inspecting all histograms)?
One possible way is to split your image into n x n squares (save edge cases) as you pointed out, reduce each of them to a single value and group them according to k-nearest values (pertaining to the other pieces). After you group them, then you can select, for example, one image from each group. Something that is potentially better is to use a more relevant metric inside each group, see Comparing image in url to image in filesystem in python for two such metrics. By using this metric, you can select more than one piece from each group.
Here is an example using some duck I found around. It considers n = 128. To reduce each piece to a single number, it calculates the euclidean distance to a pure black piece of n x n.
f = Import["http://fohn.net/duck-pictures-facts/mallard-duck.jpg"];
pieces = Flatten[ImagePartition[ColorConvert[f, "Grayscale"], 128]]
black = Image[ConstantArray[0, {128, 128}]];
dist = Map[ImageDistance[#, black, DistanceFunction -> EuclideanDistance] &,
pieces];
nf = Nearest[dist -> pieces];
Then we can see the grouping by considering k = 2:
GraphPlot[
Flatten[Table[
Thread[pieces[[i]] -> nf[dist[[i]], 2]], {i, Length[pieces]}]],
VertexRenderingFunction -> (Inset[#2, #, Center, .4] &),
SelfLoopStyle -> None]
Now you could use a metric (better than the distance to black) inside each of these groups to select the pieces you want from there.
Since you would like to apply this to a large number of images, and you already suggested it, let's discuss how to solve this problem by selecting different tiles.
The first step could be to define what "similar" is, so a similarity metric is needed. You already mentioned the tiles' histogram as one source of metric, but there may be many more, for example:
mean intensity,
90th percentile of intensity,
10th percentile of intensity,
mode of intensity, as in peak of the histogram,
variance of pixel intensity in the whole tile,
granularity, which you could quickly approximate by the difference between the raw and the Gaussian-filtered image, or by calculating the average variance in small sub-tiles.
If your image has two channels, the above list leaves you already with 12 metric components. Moreover, there are characteristics that you can obtain from the combination of channels, for example the correlation of pixel intensities between channels. With two channels that's only one characteristic, but with three channels it's already three.
To pick different tiles from this high-dimensional cloud, you could consider that some if not many of these metrics will be correlated, so a principal component analysis (PCA) would be a good first step. http://en.wikipedia.org/wiki/Principal_component_analysis
Then, depending on how many sample tiles you would like to chose, you could look at the projection. For seven tiles, for example, I would look at the first three principal components, and chose from the two extremes of each, and then also pick the one tile closest to the center (3 * 2 + 1 = 7).
If you are concerned that chosing from the very extremes of each principal component may not be robust, the 10th and 90th percentiles may be. Alternatively, you could use a clustering algorithm to find separated examples, but this would depend on how your cloud looks like. Good luck.