How to interpret anchor boxes in Yolo or R-CNN? - computer-vision

For algorithms like yolo or R-CNN, they use the concept of anchor boxes for predicting objects. https://pjreddie.com/darknet/yolo/
The anchor boxes are trained on specific dataset, one for COCO dataset is:
anchors = 0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828
However, i don't understand how to interpret these anchor boxes? What does a pair of values (0.57273, 0.677385) means?

In the original YOLO or YOLOv1, the prediction was done without any assumption on the shape of the target objects. Let's say that the network tries to detect humans. We know that, generally, humans fit in a vertical rectangle box, rather than a square one. However, the original YOLO tried to detect humans with rectangle and square box with equal probability.
But this is not efficient and might decrease the prediction speed.
So in YOLOv2, we put some assumption on the shapes of the objects. These are Anchor-Boxes. Usually we feed the anchor boxes to the network as a list of some numbers, which is a series of pairs of width and height:
anchors = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]
In the above example, (0.57273, 0.677385) represents a single anchor box, in which the two elements are width and height respectively. That is, this list defines 5 different anchor boxes. Note that these values are relative to the output size. For example, YOLOv2 outputs 13x13 feature mat and you can get the absolute values by multiplying 13 to the values of anchors.
Using anchor boxes made the prediction a little bit faster. But the accuracy might decrease. The paper of YOLOv2 says:
Using anchor boxes we get a small decrease in accuracy. YOLO only
predicts 98 boxes per image but with anchor boxes our model predicts
more than a thousand. Without anchor boxes our intermediate model gets
69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even though the mAP decreases, the increase
in recall means that our model has more room to improve

That's what I understood: YOLO divides an 416x416 image into 13x13 grids. Each grid as 32 pixels. The anchor boxes size are relative to the size of the grid.
So an anchor box of width and height 0.57273, 0.677385 pixels has actually a size of
w = 0.57273 * 32 = 18.3 pixels
h = 0.677385 * 32 = 21.67 pixels
If you convert all those values, you can then plot them on a 416x416 image to visualise them.

Related

Disparity Map Block Matching

I am writing a disparity matching algorithm using block matching, but I am not sure how to find the corresponding pixel values in the secondary image.
Given a square window of some size, what techniques exist to find the corresponding pixels? Do I need to use feature matching algorithms or is there a simpler method, such as summing the pixel values and determining whether they are within some threshold, or perhaps converting the pixel values to binary strings where the values are either greater than or less than the center pixel?
I'm going to assume you're talking about Stereo Disparity, in which case you will likely want to use a simple Sum of Absolute Differences (read that wiki article before you continue here). You should also read this tutorial by Chris McCormick before you read more here.
side note: SAD is not the only method, but it's really common and should solve your problem.
You already have the right idea. Make windows, move windows, sum pixels, find minimums. So I'll give you what I think might help:
To start:
If you have color images, first you will want to convert them to black and white. In python you might use a simple function like this per pixel, where x is a pixel that contains RGB.
def rgb_to_bw(x):
return int(x[0]*0.299 + x[1]*0.587 + x[2]*0.114)
You will want this to be black and white to make the SAD easier to computer. If you're wondering why you don't loose significant information from this, you might be interested in learning what a Bayer Filter is. The Bayer Filter, which is typically RGGB, also explains the multiplication ratios of the Red, Green, and Blue portions of the pixel.
Calculating the SAD:
You already mentioned that you have a window of some size, which is exactly what you want to do. Let's say this window is n x n in size. You would also have some window in your left image WL and some window in your right image WR. The idea is to find the pair that has the smallest SAD.
So, for each left window pixel pl at some location in the window (x,y) you would the absolute value of difference of the right window pixel pr also located at (x,y). you would also want some running value, which is the sum of these absolute differences. In sudo code:
SAD = 0
from x = 0 to n:
from y = 0 to n:
SAD = SAD + absolute_value|pl - pr|
After you calculate the SAD for this pair of windows, WL and WR you will want to "slide" WR to a new location and calculate another SAD. You want to find the pair of WL and WR with the smallest SAD - which you can think of as being the most similar windows. In other words, the WL and WR with the smallest SAD are "matched". When you have the minimum SAD for the current WL you will "slide" WL and repeat.
Disparity is calculated by the distance between the matched WL and WR. For visualization, you can scale this distance to be between 0-255 and output that to another image. I posted 3 images below to show you this.
Typical Results:
Left Image:
Right Image:
Calculated Disparity (from the left image):
you can get test images here: http://vision.middlebury.edu/stereo/data/scenes2003/

How to place text in the certain position when using SceneKit?

I have tried to follow the same simple logic as for cylinder, box … , so just by defining the position for textNode, but it is not working.
func makeText(text3D: String, position: SCNVector3, depthOfText: CGFloat, color: NSColor, transparency: CGFloat) -> SCNNode
{
let textTodraw = SCNText(string: text3D, extrusionDepth: depthOfText)
textTodraw.firstMaterial?.transparency = transparency
textTodraw.firstMaterial?.diffuse.contents = color
let textNode = SCNNode(geometry: textTodraw)
textNode.position = position
return textNode
}
The default font for SCNText is 36 point Helvetica, and a "point" in font size is the same as a unit of scene space. (Well, of local space for the node containing the SCNText geometry. But unless you've set a scale factor on your node, local space units are the same as scene space units.) That means even a short label can be tens of units tall and hundreds of units wide.
It's typical to build SceneKit scenes with smaller scope — for example, simple test scenes like you might throw together in a Swift playground using the default sizes for SCNBox, SCNSphere, etc might be only 3-4 units wide. (And if you're using SceneKit with ARKit, scene units are meters, so some text in 36 "point" font is the size of a few office blocks downtown.)
Also, the anchor point for a text geometry relative to its containing node is at the lower left corner of the text. Put all this together and it's entirely possible that there are giant letters looming over the rest of your scene, hiding just out of camera view.
Note that if you try to fix this by setting a much smaller font on your SCNText, the text might get jagged and chunky. That's because the flatness property is measured relative to the point size of the text (more precisely, it's measured in a coordinate system where one unit == one point of text size). So if you choose a font size that'd be tiny by screen/print standards, you'll need to scale down the flatness accordingly to still get smooth curves in your letters.
Alternatively, you can leave font sizes and flatness alone — instead, set a scale factor on the node containing the text geometry, or set that node's pivot to a transform matrix that scales down its content. For example, if you set a scale factor of 1/72, one unit of scene space is the same as one "inch" (72 points) of text height — depending on the other sizes in your scene, that might make it a bit easier to think of font sizes the way you do in 2D.
The fact is you generally just use "small numbers" for font sizes in SceneKit.
In 3D you always use real meters. A humanoid robot must be about "2" units tall, a car is about "3" units long and so on.
Very typical sizes for the font is about "0.1"
Note that the flatness value is SMALL, usually about one hundredth the size of the font. (Which is obvious, it's how long the line segments are.)
Typical:
t.font = UIFont(name: "Blah", size: 0.10)
t.flatness = 0.001
Set the flatness to about 1/4 the size of the font (hence, 0.025 in the example) to understand what "flatness" is.
I would never change the scale of the type node, that's a bad idea for many reasons. There's absolutely no reason to do so, and it makes it very difficult to genuinely set the flatness appropriately.
But note ...
That being said, on the different platforms and different versions, SCNText() often does a basically bad job drawing text, and it can go to hell at small numbers. So yeah, you may indeed have to scale in practice, if the text construction is crap at (very) small values :/

Crop image by detecting a specific large object or blob in image?

Please anyone help me to resolve my issue. I am working on image processing based project and I stuck at a point. I got this image after some processing and for further processing i need to crop or detect only deer and remove other portion of image.
This is my Initial image:
And my result should be something like this:
It will be more better if I get only a single biggest blob in the image and save it as a image.
It looks like the deer in your image is pretty much connected and closed. What we can do is use regionprops to find all of the bounding boxes in your image. Once we do this, we can find the bounding box that gives the largest area, which will presumably be your deer. Once we find this bounding box, we can crop your image and focus on the deer entirely. As such, assuming your image is stored in im, do this:
im = im2bw(im); %// Just in case...
bound = regionprops(im, 'BoundingBox', 'Area');
%// Obtaining Bounding Box co-ordinates
bboxes = reshape([bound.BoundingBox], 4, []).';
%// Obtain the areas within each bounding box
areas = [bound.Area].';
%// Figure out which bounding box has the maximum area
[~,maxInd] = max(areas);
%// Obtain this bounding box
%// Ensure all floating point is removed
finalBB = floor(bboxes(maxInd,:));
%// Crop the image
out = im(finalBB(2):finalBB(2)+finalBB(4), finalBB(1):finalBB(1)+finalBB(3));
%// Show the images
figure;
subplot(1,2,1);
imshow(im);
subplot(1,2,2);
imshow(out);
Let's go through this code slowly. We first convert the image to binary just in case. Your image may be an RGB image with intensities of 0 or 255... I can't say for sure, so let's just do a binary conversion just in case. We then call regionprops with the BoundingBox property to find every bounding box of every unique object in the image. This bounding box is the minimum spanning bounding box to ensure that the object is contained within it. Each bounding box is a 4 element array that is structured like so:
[x y w h]
Each bounding box is delineated by its origin at the top left corner of the box, denoted as x and y, where x is the horizontal co-ordinate while y is the vertical co-ordinate. x increases positively from left to right, while y increases positively from top to bottom. w,h are the width and height of the bounding box. Because these points are in a structure, I extract them and place them into a single 1D vector, then reshape it so that it becomes a M x 4 matrix. Bear in mind that this is the only way that I know of that can extract values in arrays for each structuring element efficiently without any for loops. This will facilitate our searching to be quicker. I have also done the same for the Area property. For each bounding box we have in our image, we also have the attribute of the total area encapsulated within the bounding box.
Thanks to #Shai for the spot, we can't simply use the bounding box co-ordinates to determine whether or not something has the biggest area within it as we could have a thin diagonal line that could drive the bounding box co-ordinates to be higher. As such, we also need to rely on the total area that the object takes up within the bounding box as well. Simply put, it's just the sum of all of the pixels that are contained within the object.
Therefore, we search the entire area vector that we have created to see which has the maximum area. This corresponds to your deer. Once we find this location, extract the bounding box locations, then use this to crop the image. Bear in mind that the bounding box values may have floating point numbers. As the image co-ordinates are in integer based, we need to remove these floating point values before we decide to crop. I decided to use floor. I then write code that displays the original image, with the cropped result.
Bear in mind that this will only work if there is just one object in the image. If you want to find multiple objects, check bwboundaries in MATLAB. Otherwise, I believe this should get you started.
Just for completeness, we get the following result:
While object detection is a very general CV task, you can start with something simple if the assumptions are strong enough and you can guarantee that the input images will contain a single prominent white blob well described by a bounding box.
One very simple idea is to subdivide the picture in 3x3=9 patches, calculate the statistics for each patch and compute some objective function. In the most simple case you just do a grid search over various partitions and select that with the highest objective metric. Here's an illustration:
If every line is a parameter x_1, x_2, y_1 and y_2, then you want to optimize
either by
grid search (try all x_i, y_i in some quantization steps)
genetic-algorithm-like random search
gradient descent (move every parameter in that direction that optimizes the target function)
The target function F can be define over statistics of the patches, e.g. like this
F(9 patches) {
brightest_patch = max(patches)
others = patches \ brightest_patch
score = brightness(brightest_patch) - 1/8 * brightness(others)
return score
}
or anything else that incorporates relevant statistics of the patches as well as their size. This also allows to incorporate a "prior knowledge": if you expect the blob to appear in the middle of the image, then you can define a "regularization" term that will penalize F if the parameters x_i and y_i deviate from the expected position too much.
Thanks to all who answer and comment on my Question. With your help I got my exact solution. I am posting my final code and result for others.
img = im2bw(imread('deer.png'));
[L, num] = bwlabel(img, 4);
%%// Get biggest blob or object
count_pixels_per_obj = sum(bsxfun(#eq,L(:),1:num));
[~,ind] = max(count_pixels_per_obj);
biggest_blob = (L==ind);
%%// crop only deer
bound = regionprops(biggest_blob, 'BoundingBox');
%// Obtaining Bounding Box co-ordinates
bboxes = reshape([bound.BoundingBox], 4, []).';
%// Obtain this bounding box
%// Ensure all floating point is removed
finalBB = floor(bboxes);
out = biggest_blob(finalBB(2):finalBB(2)+finalBB(4),finalBB(1):finalBB(1)+finalBB(3));
%%// Show images
figure;
imshow(out);

find the same area between 2 images

I want to merge 2 images. How can i remove the same area between 2 images?
Can you tell me an algorithm to solve this problem. Thanks.
Two image are screenshoot image. They have the same width and image 1 always above image 2.
When two images have the same width and there is no X-offset at the left side this shouldn't be too difficult.
You should create two vectors of integer and store the CRC of each pixel row in the corresponding vector element. After doing this for both pictures you find the CRC of the first line of the lower image in the first vector. This is the offset in the upper picture. Then you check that all following CRCs from both pictures are identical. If not, you have to look up the next occurrence of the initial CRC in the upper image again.
After checking that the CRCs between both pictures are identical when you apply the offset you can use the bitblit function of your graphics format and build the composite picture.
I haven't come across something similar before but I think the following might work:
Convert both to grey-scale.
Enhance the contrast, the grey box might become white for example and the text would become more black. (This is just to increase the confidence in the next step)
Apply some threshold, converting the pictures to black and white.
afterwards, you could find the similar areas (and thus the offset of overlap) with a good degree of confidence. To find the similar parts, you could harper's method (which is good but I don't know how reliable it would be without the said filtering), or you could apply some DSP operation(s) like convolution.
Hope that helps.
If your images are same width and image 1 is always on top. I don't see how that hard could it be..
Just store the bytes of the last line of image 1.
from the first line to the last of the image 2, make this test :
If the current line of image 2 is not equal to the last line of image 1 -> continue
else -> break the loop
you have to define a new byte container for your new image :
Just store all the lines of image 1 + all the lines of image 2 that start at (the found line + 1).
What would make you sweat here is finding the libraries to manipulate all these data structures. But after a few linkage and documentation digging, you should be able to easily implement that.

Designing a grid overlay based on longitudes and latitudes

I'm trying to figure out the best way to approach the following:
Say I have a flat representation of the earth. I would like to create a grid that overlays this with each square on the grid corresponding to about 3 square kilometers. Each square would have a unique region id. This grid would just be stored in a database table that would have a region id and then probably the long/lat coordinates of the four corners of the region, right? Any suggestions on how to generate this table easily? I know I would first need to find out the width and height of this "flattened earth" in kms, calculate the number of regions, and then somehow assign the long/lats to each intersection of vertical/horizontal line; however, this sounds like a lot of manual work.
Secondly, once I have that grid table created, I need to design a fxn that takes a long/lat pair and then determines which logical "region" it is in. I'm not sure how to go about this.
Any help would be appreciated.
Thanks.
Assume the Earth is a sphere with radius R = 6371 km.
Start at (lat, long) = (0, 0) deg. Around the equator, 3km corresponds to a change in longitude of
dlong = 3 / (2 * pi * R) * 360
= 0.0269796482 degrees
If we walk around the equator and put a marker every 3km, there will be about (2 * pi * R) / 3 = 13343.3912 of them. "About" because it's your decision how to handle the extra 0.3912.
From (0, 0), we walk North 3 km to (lat, long) (0.0269796482, 0). We will walk around the Earth again on a path that is locally parallel to the first path we walked. Because it is a little closer to the N Pole, the radius of this circle is a bit smaller than that of the first circle we walked. Let's use lower case r for this radius
r = R * cos(lat)
= 6371 * cos(0.0269796482)
= 6 368.68141 km
We calculate dlong again using the smaller radius,
dlong = 3 / (2 * pi * r) * 360
= 0.0269894704 deg
We put down the second set of flags. This time there are about (2 * pi * r) / 3 = 13 338.5352 of them. There were 13,343 before, but now there are 13,338. What's that? five less.
How do we draw a ribbon of squares when there are five less corners in the top line? In fact, as we walked around the Earth, we'd find that we started off with pretty good squares, but that the shape of the regions sheared out into pretty extreme parallelograms.
We need a different strategy that gives us the same number of corners above and below. If the lower boundary (SW-SE) is 3 km long, then the top should be a little shorter, to make a ribbon of trapeziums.
There are many ways to craft a compromise that approximates your ideal square grid. This wikipedia article on map projections that preserve a metric property, links to several dozen such strategies.
The specifics of your app may allow you to simplify things considerably, especially if you don't really need to map the entire globe.
Microsoft has been investing in spatial data types in their SQL Server 2008 offering. It could help you out here. Because it has data types to represent your flattened earth regions, operators to determine when a set of coordinates is inside a geometry, etc. Even if you choose not to use this, consider checking out the following links. The second one in particular has a lot of good background information on the problem and a discussion on some of the industry standard data formats for spatial data.
http://www.microsoft.com/sqlserver/2008/en/us/spatial-data.aspx
http://jasonfollas.com/blog/archive/2008/03/14/sql-server-2008-spatial-data-part-1.aspx
First, Paul is right. Unfortunately the earth is round which really complicates the heck out of this stuff.
I created a grid similar to this for a topographical mapping server many years ago. I just recoreded the coordinates of the upper left coder of each region. I also used UTM coordinates instead of lat/long. If you know that each region covers 3 square kilometers and since UTM is based on meters, it is straight forward to do a range query to discover the right region.
You do realize that because the earth is a sphere that "3 square km" is going to be a different number of degrees near the poles than near the equator, right? And that at the top and bottom of the map your grid squares will actually represent pie-shaped parts of the world, right?
I've done something similar with my database - I've broken it up into quad cells. So what I did was divide the earth into four quarters (-180,-90)-(0,0), (-180,0)-(0,90) and so on. As I added point entities to my database, if the "cell" got more than X entries, I split the cell into 4. That means that in areas of the world with lots of point entities, I have a lot of quad cells, but in other parts of the world I have very few.
My database for the quad tree looks like:
\d areaids;
Table "public.areaids"
Column | Type | Modifiers
--------------+-----------------------------+-----------
areaid | integer | not null
supercededon | timestamp without time zone |
supercedes | integer |
numpoints | integer | not null
rectangle | geometry |
Indexes:
"areaids_pk" PRIMARY KEY, btree (areaid)
"areaids_rect_idx" gist (rectangle)
Check constraints:
"enforce_dims_rectangle" CHECK (ndims(rectangle) = 2)
"enforce_geotype_rectangle" CHECK (geometrytype(rectangle) = 'POLYGON'::text OR rectangle IS NULL)
"enforce_srid_rectangle" CHECK (srid(rectangle) = 4326)
I'm using PostGIS to help find points in a cell. If I look at a cell, I can tell if it's been split because supercededon is not null. I can find its children by looking for ones that have supercedes equal to its id. And I can dig down from top to bottom until I find the ones that cover the area I'm concerned about by looking for ones with supercedeson null and whose rectangle overlaps my area of interest (using the PostGIS '&' operator).
There's no way you'll be able to do this with rectangular cells, but I've just finished an R package dggridR which would make this easy to do using a grid of hexagonal cells. However, the 3km cell requirement might yield so many cells as to overload your machine.
You can use R to generate the grid:
install.packages('devtools')
install.packages('rgdal')
library(devtools)
devools.install_github('r-barnes/dggridR')
library(dggridR)
library(rgdal)
#Construct a discrete global grid (geodesic) with cells of ~3 km^2
dggs <- dgconstruct(area=100000, metric=FALSE, resround='nearest')
#Get a hexagonal grid for the whole earth based on this dggs
grid <- dgearthgrid(dggs,frame=FALSE)
#Save the grid
writeOGR(grid, "grid_3km_cells.kml", "cells", "KML")
The KML file then contains the ids and edge vertex coordinates of every cell.
The grid looks a little like this:
My package is based on Kevin Sahr's DGGRID which can generate this same grid to KML directly, though you'll need to figure out how to compile it yourself.