How anchor box mechanism and NMS work for Faster R-CNN? - computer-vision

How can anchor box and NMS be applied to a test set when the Ground Truth is unknown?
Without the Ground Truth, the IOU of the anchor box is unknown, so are all the anchor boxes used to predict the box location?
Then, why are they used in training the first place.

1 - You don't need GT in the validation or test phase except measuring the performance of model. Because the model have already learned the regression coefficients in the training phase. These coefficients are used to fit the default anchor boxes to predictions.
2 - NMS eliminates the region proposals without using GT. It aims to select best proposal over the other proposals while comparing them against each other.
3 - If you don't know the GT you can't learn the classes and coordinates of objects during the training phase. So you need compare your GT with predictions.
You can check out this great article to have comprehensive understanding of R-CNN family.

Related

Questions about R-CNN, RPN, anchor box

My brief understanding of Faster R-CNN
S1. assign anchor box
S2. finding anchor box objectness score (rather positive, negative, non) using ground truth (ROI pooling?)
S3. select a set of positive, negative boxes to train region proposal network (training for predicting objectness score in test set?)
S4-1. training classification layer using positive anchor box
S4-2. training box regression layer using positive anchor box
.....
Questions
Q1. When an anchor box has high IOU with several ground truth boxes is the highest box chosen as the target? Can an anchor box only have on target?
Q2. How is a positive anchor box and its target box matched in code? (Most explanation says an anchor box object contains four variables; center x, center y, width, height, which has no value for its target.)
Q3. Is ROI pooling referring to stage two?
Q4. Is the third stage training used to predict the objectness score for test set?
Q5. Is there a reason other than training speed, for not training anchor boxes that are not labeled as either positive or negative? (Aren’t all boxes’ objectness score estimated when training? Including positive, negative, non)
Q6. Shouldn’t 4-1 classification happen after moving of the box during 4-2 box regression? (Explanations say the two layers are independent and happen simultaneously) Shouldn’t the two layers have order?
Q7. Is the probability of an anchor box used for NMS referring to the classification score calculated in stage 4-1?
Q8. Unlike the RPN stage where all anchor boxes are used for training, the classification and regression stages only use a few positive anchor boxes. How does back-propagation happen when training a model like this, where some stages use only a part of the training set? (I heard the advantages of Faster-RCNN is the connection of all stages as a single deep-learning model. But if the latter models only use chosen boxes and if the classification/box regression stage works independently how can the full model work as one?)

What's the difference between "BB regression algorithms used in R-CNN variants" vs "BB in YOLO" localization techniques?

Question:
What's the difference between the bounding box(BB) produced by "BB regression algorithms in region-based object detectors" vs "bounding box in single shot detectors"? and can they be used interchangeably if not why?
While understanding variants of R-CNN and Yolo algorithms for object detection, I came across two major techniques to perform object detection i.e Region-based(R-CNN) and niche-sliding window based(YOLO).
Both use different variants(complicated to simple) in both regimes but in the end, they are just localizing objects in the image using Bounding boxes!. I am just trying to focus on the localization(assuming classification is happening!) below since that is more relevant to the question asked & explained my understanding in brief:
Region-based:
Here, we let the Neural network to predict continuous variables(BB coordinates) and refers to that as regression.
The regression that is defined (which is not linear at all), is just a CNN or other variants(all layers were differentiable),outputs are four values (𝑟,𝑐,ℎ,𝑤), where (𝑟,𝑐) specify the values of the position of the left corner and (ℎ,𝑤) the height and width of the BB.
In order to train this NN, a smooth L1 loss was used to learn the precise BB by penalizing when the outputs of the NN are very different from the labeled (𝑟,𝑐,ℎ,𝑤) in the training set!
niche-Sliding window(convolutionally implemented!) based:
first, we divide the image into say 19*19 grid cells.
the way you assign an object to a grid-cell is by selecting the midpoint of an object and then assigning that object to whichever one grid cell contains the midpoint of the object. So each object, even if the objects span multiple grid cells, that object is assigned only to one of the 19 by 19 grid cells.
Now, you take the two coordinates of this grid-cell and calculate the precise BB(bx, by, bh, bw) for that object using some method such as
(bx, by, bh, bw) are relative to the grid cell where x & y are center point and h & w are the height of precise BB i.e the height of the bounding box is specified as a fraction of the overall width of the grid cell and h& w can be >1.
There multiple ways of calculating precise BB specified in the paper.
Both Algorithms:
outputs precise bounding boxes.!
works in supervised learning settings, they were using labeled dataset where the labels are bounding boxes stored(manually marked my some annotator using tools like labelimg ) for each image in a JSON/XML file format.
I am trying to understand the two localization techniques on a more abstract level(as well as having an in-depth idea of both techniques!) to get more clarity on:
in what sense they are different?, &
why 2 were created, I mean what are the failure/success points of 1 on the another?.
and can they be used interchangeably, if not then why?
please feel free to correct me if I am wrong somewhere, feedback is highly appreciated! Citing to any particular section of a research paper would be more rewarding!
The essential differences are that two-stage Faster R-CNN-like are more accurate while single-stage YOLO/SSD-like are faster.
In two-stage architectures, the first stage is usually of region proposal, while the second stage is for classification and more accurate localization. You can think of the first stage as similar to the single-stage architectures, when the difference is that the region proposal only separates "object" from "background", while the single-stage distinguishes between all object classes. More explicitly, in the first stage, also in a sliding window-like fashion, an RPN says whether there's an object present or not, and if there is - to roughly give the region (bounding box) in which it lies. This region is used by the second stage for classification and bounding box regression (for better localization) by first pooling the relevant features from the proposed region, and then going through the Fast R-CNN-like architecture (which does the classificaion+regression).
Regarding your question about interchanging between them - why would you want to do so? Usually you would choose an architecture by your most pressing needs (e.g. latency/power/accuracy), and you wouldn't want to interchange between them unless there's some sophisticated idea which will help you somehow.

Ways to combine two deep learning based classifiers

I want to have a primary CNN based classifier and a similar secondary classifier for image regions.
Both classifiers will be used on image regions. I need the first classifier to be used on a primary region while the secondary classifier to be used on assistive regions and will be used to support the decision made by the first classifier with further evidence.
Thus the primary image region and the assistive ones will be used to infer one class label at a time.
What other ways or architectures exist these days to perform such a task, instead of ROI Pooling?
Ideally, I would like to have a classifier scheme similar to the one of this paper but without the use of ROI Pooling.
https://arxiv.org/pdf/1505.01197.pdf
You can take a look at this https://arxiv.org/pdf/1611.10012.pdf which contains comprehensive survey of recent detection architectures. Basically there are 3 meta architectures and all models fall into one of those categories:
Faster-RCNN: Similar to the paper you have referenced, this is the improved version of fast-rcnn which did not use selective search and directly integrate proposal generation into the network known as region proposal network(rpn).
RFCN:similar in architecture to 1, except that roi pooling is performed differently, known as position sensitive roi pooling.
SSD:Modifies the rpn in Faster-rcnn to directly output class probabilities, eliminating the need for per-roi computation as is done in roi pooling. This is the fastest architecture type. Yolo falls into this architecture.
I think based on my rough read through of the paper you have referenced, type 3 is the one you are looking for. However, in terms of implementation, it can be a little tricky to implement equation 3, i.e. you may need to stop backpropagating gradients to regions(or at least think about how it could effect the final results) that do not overlap with primary region as this architecture type computes probabilities for whole image.
I also note that there is in fact no primary/secondary "classifiers". The paper described primary/secondary "regions", the primary region is the region that contains the person(i.e use a person detector to find primary region first). And secondary regions are those that overlap with primary region. For activity classification there is only one classifier except that primary region carries more weight and secondary regions each contribute a little to the final prediction score.
Yaw Lin's answer contains a good amount of information, I'll just build on what he said in his last paragraph. I think the essence of what you want to do is not so much to process the person and the background independently and compare the results (that's clearly what you said you're doing), but to process the background first and infer from it the kinds of expectations you have for the primary region. Once you have some expectations, you can compare the primary region to the most significant expectations.
For example, from Figure 1 (b) in your Arxiv link, if you can process the background and determine that it's outdoors in a highly populated region, then you can focus a lot of the probability density function of what the person is doing in social outdoor activities, making jogging much more likely as a guess before you even process the figure you're interested in. In contrast, for Figure 1 (a), if you can process the background and tell that it's indoors and contains computers, then you can focus probability on lone indoor computer-based activities, skyrocketing the probability of "working on a computer".

Turn on boundary layer if more than one record in the map extent/ layout view?

Anyone have any ideas as to how can I automate the process where if there is more than one administrative boundary area in the map extent to turn on that boundary layer for map labeling using a python script from within a map document?
For instance, if there are multiple county boundaries within the visible map extent (say the area of interest overlaps two counties) to turn on the boundary layer? I do not want to tabulate intersection on the area of interest as it does not cover the entire layout. In effect, if only one county is displayed in the map extent/layout, do not turn on the county layer. However, if it does display more than one county, turn on the county layer for display in the map extent/layout. I am trying to automate map production and am stuck on this one as I am “tabulating the intersection” of the map/layout extent, not a specific feature class.
Make sense? Thanks for any and all guidance as to how to approach this.
Using ArcGIS 10.1 SP1 Advanced
If discovered a way yo do this. I snagged an script that creates a polygon from the current map extent. I then performed a tabulate intersection on the boundary using this polygon. If the length of the resulting table was larger than 1, I turned the layer on.
Create polygon from map extent script link

Shape-matching of plots using non-linear least squares

What would b the best way to implement a simple shape-matching algorithm to match a plot interpolated from just 8 points (x, y) against a database of similar plots (> 12 000 entries), each plot having >100 nodes. The database has 6 categories of plots (signals measured under 6 different conditions), and the main aim is to find the right category (so for every category there's around 2000 plots to compare against).
The 8-node plot would represent actual data from measurement, but for now I am simulating this by selecting a random plot from the database, then 8 points from it, then smearing it using gaussian random number generator.
What would be the best way to implement non-linear least-squares to compare the shape of the 8-node plot against each plot from the database? Are there any c++ libraries you know of that could help with this?
Is it necessary to find the actual formula (f(x)) of the 8-node plot to use it with least squares, or will it be sufficient to use interpolation in requested points, such as interpolation from the gsl library?
You can certainly use least squares without knowing the actual formula. If all of your plots are measured at the same x value, then this is easy -- you simply compute the sum in the normal way:
where y_i is a point in your 8-node plot, sigma_i is the error on the point and Y(x_i) is the value of the plot from the database at the same x position as y_i. You can see why this is trivial if all your plots are measured at the same x value.
If they're not, you can get Y(x_i) either by fitting the plot from the database with some function (if you know it) or by interpolating between the points (if you don't know it). The simplest interpolation is just to connect the points with straight lines and find the value of the straight lines at the x_i that you want. Other interpolations might do better.
In my field, we use ROOT for these kind of things. However, scipy has a great collections of functions, and it might be easier to get started with -- if you don't mind using Python.
One major problem you could have would be that the two plots are not independent. Wikipedia suggests McNemar's test in this case.
Another problem you could have is that you don't have much information in your test plot, so your results will be affected greatly by statistical fluctuations. In other words, if you only have 8 test points and two plots match, how will you know if the underlying functions are really the same, or if the 8 points simply jumped around (inside their error bars) in such a way that it looks like the plot from the database -- purely by chance! ... I'm afraid you won't really know. So the plots that test well will include false positives (low purity), and some of the plots that don't happen to test well were probably actually good matches (low efficiency).
To solve that, you would need to either use a test plot with more points or else bring in other information. If you can throw away plots from the database that you know can't match for other reasons, that will help a lot.