Conceptual Question Regarding the Yolo Object Detection Algorithm - computer-vision

My understanding is that the motivation for Anchor Boxes (in the Yolo v2 algorithm) is that in the first version of Yolo (Yolo v1) it is not possible to detect multiple objects in the same grid box. I don't understand why this is the case.
Also, the original paper by the authors (Yolo v1) has the following quote:
"Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts."
Doesn't this indicate that a grid cell can recognize more than one object? In their paper, they take B as 2. Why not take B as some arbitrarily higher number, say 10?
Second question: how are the Anchor Box dimensions tied to the Bounding Box dimensions, for detecting a particular object? Some websites say that the Anchor Box defines a shape only, and others say that it defines a shape and a size. In either case, how is the Anchor Box tied to the Bounding Box?
Thanks,
Sandeep

You're right that YOLOv1 has multiple (B) bounding boxes, but these are not assigned to ground truths in an effective or systematic way, and therefore also don't infer bounding boxes accurate enough.
As you can read on blog posts over the internet, an Anchor/Default Box is a box in the original image which corresponds to a specific cell in a specific feature map, which is assigned with specific aspect ratio and scale.
The scale is usually dictated by the feature map (deeper feature map -> large anchor scale), and the aspect ratio vary, e.g. {1:1, 1:2, 2:1} or {1:1, 1:2, 2:1, 1:3, 3:1}.
By the scale and aspect ratio, a specific shape is dictated, and this shape, with a position dictated by the position of the current cell in the feature map, is compared to ground truth bounding boxes in the original image.
Different papers have different assignment schemes, but it's usually goes like this: (1) if the IoU of the anchor on the original image with a GT is over some threshold (e.g. 0.5), then this is a positive assignment to the anchor, (2) if it's under some threshold (e.g. 0.1), then it's a negative assignment, and (3) if there's a gap between these two thresholds - then the anchors in between are ignored (in the loss computation).
This way, an anchor is in fact like a "detector head" responsible for specific cases, which are the most similar to it shape-wise. It is therefore responsible to detect objects with shape similar to it, and it infers both confidence to each class, and bounding box parameters relative to it, i.e. how much to modify the anchor's height, width, and center (in the two axes) to receive the correct bounding box.
Because of this assignment scheme, which distributes the responsibility effectively between the different anchors, the bounding box prediction is more accurate.
Another downside to YOLOv1's scheme is the fact that it decouples bounding box and classification. On one hand, this saves computation, but on the other hand - the classification is on the level of grid cell. Therefore the B options for bounding boxes all have the same class prediction. This means, for example, that if there are multiple objects of different class with the same center (e.g. person holding a cat), then the classification of at least all but one will be wrong. Note that it is theoretically possible that other predictions of adjacent grid cells will compensate on this wrong classification, but it is not promised, in particular since by the YOLOv1's scheme, the center is the assignment criteria.

Related

KLT-Tracker: How to avoid losing detected person on recalibration

Reading from the ground truth, I have an initial bounding box. I then calculate a foreground mask and use cv2.goodFeaturesToTrack to get points I can follow using cv2.calcOpticalFlowPyrLK. I calculate the bounding box by taking the largest possible rectangle going through the out-most points (roughly as described here: # How to efficiently find the bounding box of a collection of points? )
However, every now and then I need to recalculate the goodFeaturesToTrack to avoid the person "losing" all the points over time.
Whenever recalculating, points may land on other people, if they stand within the bounding box of the person to track. They will now be followed, too. Due to that my bounding box fails to be of any use after such a recalculation. What are some methods to avoid such a behavior?
I am looking for resources and general explanations and not specific implementations.
Ideas I had
Take the size of the previous bounding box divided by the current bounding box size into account and ignore if size changes too much.
Take the previous white-fullness of the foreground mask divided by the current white-fullness of the foreground mask into account. Do not re-calculate the bounding box if the foreground mask is too full. Other people are probably crossing the box.
Calculate a general movement vector for the bounding box from the median of all points calculated using optical flow. Alter the bounding box only within some relation to the vector to avoid rapid changes of the bounding box.
Filter the found good features to track points using some additional metric.
In general I am looking for a method that calculates new goodFeaturesToTrack based stronger on the previous goodFeaturesToTrack or the points derived from that using optical flow, I guess.

What's the difference between "BB regression algorithms used in R-CNN variants" vs "BB in YOLO" localization techniques?

Question:
What's the difference between the bounding box(BB) produced by "BB regression algorithms in region-based object detectors" vs "bounding box in single shot detectors"? and can they be used interchangeably if not why?
While understanding variants of R-CNN and Yolo algorithms for object detection, I came across two major techniques to perform object detection i.e Region-based(R-CNN) and niche-sliding window based(YOLO).
Both use different variants(complicated to simple) in both regimes but in the end, they are just localizing objects in the image using Bounding boxes!. I am just trying to focus on the localization(assuming classification is happening!) below since that is more relevant to the question asked & explained my understanding in brief:
Region-based:
Here, we let the Neural network to predict continuous variables(BB coordinates) and refers to that as regression.
The regression that is defined (which is not linear at all), is just a CNN or other variants(all layers were differentiable),outputs are four values (𝑟,𝑐,ℎ,𝑤), where (𝑟,𝑐) specify the values of the position of the left corner and (ℎ,𝑤) the height and width of the BB.
In order to train this NN, a smooth L1 loss was used to learn the precise BB by penalizing when the outputs of the NN are very different from the labeled (𝑟,𝑐,ℎ,𝑤) in the training set!
niche-Sliding window(convolutionally implemented!) based:
first, we divide the image into say 19*19 grid cells.
the way you assign an object to a grid-cell is by selecting the midpoint of an object and then assigning that object to whichever one grid cell contains the midpoint of the object. So each object, even if the objects span multiple grid cells, that object is assigned only to one of the 19 by 19 grid cells.
Now, you take the two coordinates of this grid-cell and calculate the precise BB(bx, by, bh, bw) for that object using some method such as
(bx, by, bh, bw) are relative to the grid cell where x & y are center point and h & w are the height of precise BB i.e the height of the bounding box is specified as a fraction of the overall width of the grid cell and h& w can be >1.
There multiple ways of calculating precise BB specified in the paper.
Both Algorithms:
outputs precise bounding boxes.!
works in supervised learning settings, they were using labeled dataset where the labels are bounding boxes stored(manually marked my some annotator using tools like labelimg ) for each image in a JSON/XML file format.
I am trying to understand the two localization techniques on a more abstract level(as well as having an in-depth idea of both techniques!) to get more clarity on:
in what sense they are different?, &
why 2 were created, I mean what are the failure/success points of 1 on the another?.
and can they be used interchangeably, if not then why?
please feel free to correct me if I am wrong somewhere, feedback is highly appreciated! Citing to any particular section of a research paper would be more rewarding!
The essential differences are that two-stage Faster R-CNN-like are more accurate while single-stage YOLO/SSD-like are faster.
In two-stage architectures, the first stage is usually of region proposal, while the second stage is for classification and more accurate localization. You can think of the first stage as similar to the single-stage architectures, when the difference is that the region proposal only separates "object" from "background", while the single-stage distinguishes between all object classes. More explicitly, in the first stage, also in a sliding window-like fashion, an RPN says whether there's an object present or not, and if there is - to roughly give the region (bounding box) in which it lies. This region is used by the second stage for classification and bounding box regression (for better localization) by first pooling the relevant features from the proposed region, and then going through the Fast R-CNN-like architecture (which does the classificaion+regression).
Regarding your question about interchanging between them - why would you want to do so? Usually you would choose an architecture by your most pressing needs (e.g. latency/power/accuracy), and you wouldn't want to interchange between them unless there's some sophisticated idea which will help you somehow.

How to classify edge-images which belong to two highly changeable but distinguishable classes according to edges (contours) "curviness"

Introduction
I am working on a project in OpenCV in C++ and I am trying to classify a small image (a frame from a video) containing many edges into two groups. I would also like to retain the information of how close the image is to class A or to class B, because this classification does not seem to be simply a binary problem - sometimes the images is a mixture of class A and B.
Class A can be roughly determined by existence of curvy/smooth edges, similar to arches or parts of elliptic structures, which often are oriented into some kind of center (like in a tunnel).
The other class, class B is usually very chaotic and the edges on such image are definitely less curvy, they are often winding and usually don't have any kind of "center of attention".
Classes:
The images from both classes are in following links:
Group A
Group B
Previous approaches and ideas
I was trying to separate each sufficiently long contour and then calculate some kind of curvature/curviness coefficient - basically I downsampled the contour (by 10), then I traversed along the new contour and calculated the average value of absolute angle between two segments created by 3 consecutive points. Then based on this value I determined whether a current contour is "curvy" or not. According to that I calculated a few features:
total length of curvy contours / total length of all contours
number of curvy contours / number of all contours
etc.
However, the very curvature calculation seems to be not working very robustly (in one frame a contour is considered to be curvy, in second the same contour with slighly changed shape is not etc.) and also setting the threshold to determine which contour is curvy and which is not (based on "avereage curvines of single contour") is difficult to set properly. Also, such approach in no way takes into account the specific "shape" of class A and therefore classification results are very poor.
I was thinking about some kind of ellipse fitting, but as you can see, the class A is more like a group of arches than actual ellipse or circle.
I was reading about some ways of comparing edge maps such as Hausdorff matching, but it does not seem to be very helpful in my case. Also, it is important, that I want to keep the algorithm simple, because it has to be able to work in real time and also it is only a part of bigger software.
Finally, my question is:
Do you have any ideas of any other, better features and calculations that I can use to describe and then classify such edges/images? Is there a robust solution to describe my classes?

Finding Circle Edges :

Finding Circle Edges :
Here are the two sample images that i have posted.
Need to find the edges of the circle:
Does it possible to develop one generic circle algorithm,that could find all possible circles in all scenarios ?? Like below
1. Circle may in different color ( White , Black , Gray , Red)
2. Background color may be different
3. Different in its size
http://postimage.org/image/tddhvs8c5/
http://postimage.org/image/8kdxqiiyb/
Please suggest some idea to write a algorithm that should work out on above circle
Sounds like a job for the Hough circle transform:
I have not used it myself so far, but it is included in OpenCV. Among other parameters, you can give it a minimum and maximum radius.
Here are links to documentation and a tutorial.
I'd imagine your second example picture will be very hard to detect though
You could apply an edge detection transformation to both images.
Here is what I did in Paint.NET using the outline effect:
You could test edge detect too but that requires more contrast in the images.
Another thing to take into consideration is what it exactly is that you want to detect; in the first image, do you want to detect the white ring or the disc inside. In the second image; do you want to detect the all the circles (there are many tiny ones) or just the big one(s). These requirement will influence what transformation to use and how to initialize these.
After transforming the images into versions that 'highlight' the circles you'll need an algorithm to find them.
Again, there are more options than just one. Here is a paper describing an algoritm
Searching the web for image processing circle recognition gives lots of results.
I think you will have to use a couple of different feature calculations that can be used for segmentation. I the first picture the circle is recognizeable by intensity alone so that one is easy. In the second picture it is mostly the texture that differentiates the circle edge, in that case a feature image based based on some kind of texture filter will be needed, calculating the local variance for instance will result in a scalar image that can segment out the circle. If there are other features that defines the circle in other scenarios (different colors for background foreground etc) you might need other explicit filters that give a scalar difference for those cases.
When you have scalar images where the circles stand out you can use the circular Hough transform to find the circle. Either run it for different circle sizes or modify it to detect a range of sizes.
If you know that there will be only one circle and you know the kind of noise that will be present (vertical/horizontal lines etc) an alternative approach is to design a more specific algorithm e.g. filter out the noise and find center of gravity etc.
Answer to comment:
The idea is to separate the algorithm into independent stages. I do not know how the specific algorithm you have works but presumably it could take a binary or grayscale image where high values means pixel part of circle and low values pixel not part of circle, the present algorithm also needs to give some kind of confidence value on the circle it finds. This present algorithm would then represent some stage(s) at the end of the complete algorithm. You will then have to add the first stage which is to generate feature images for all kind of input you want to handle. For the two examples it should suffice with one intensity image (simply grayscale) and one image where each pixel represents the local variance. In the color case do a color transform an use the hue value perhaps? For every input feed all feature images to the later stage, use the confidence value to select the most likely candidate. If you have other unknowns that your algorithm need as input parameters (circle size etc) just iterate over the possible values and make sure your later stages returns confidence values.

Counting objects on a grid with OpenCV

I'm relatively new to OpenCV, and I'm working on a project where I need to count the number of objects on a grid. the grid is the background of the image, and there's either an object in each space or there isn't; I need to count the number present, and I don't really know where to start. I've searched here and other places, but can't seem to find what I'm looking for. I will need to be tracking the space numbers of the grid in the future, so I will also eventually need to know whether each grid space is occupied or empty. I'm not going so far as to ask for a coded example, but does anybody know of any source or tutorials to accomplish this task or one similar to it? Thanks for your help!
Further Details: images will come from a stable-mounted camera, objects are of relatively uniform shape, but varying size and color.
I would first answer a few questions:
Will an object be completely enclosed in a grid cell? Or can it be placed on top of a grid line? (In other words, will the object hide a line from the camera?)
Will more than one object be in one cell?
Can an object occupy more than one cell? (closely related to question 1)
Given reasonable answers to those questions, I believe the problem can be broken into two parts: first, identify the centers of each grid space. To count objects, you can then sample that region to see if anything "not background" is there.
You can then assume that a grid space is defined by four strong, regularly-placed, corner features. (For the sake of discussion, I'll assume you've performed the initial image preparation as needed: histogram equalization, gaussian blur for noise reduction, etc.) From there, you might try some of OpenCV's methods for finding corners (Harris corner detector, cvGoodFeaturesToTrack, etc). It's likely that you can borrow some of the techniques found in OpenCV's square finding example (samples/c/square.c). For this task, it's probably sufficient to assume that the grid center is just the centroid of each set of "adjacent" (or sufficiently near) corners.
Alternatively, you might use the Hough transform to identify the principal horizontal and vertical lines in the image. You can then determine the intersection points to identify the extents of each grid cell. This implementation might be more challenging since inferring structure (or adjacency) from "nearby" vertices in order to find a grid center seems more difficult.