Final Descriptor in SIFT - computer-vision

I am new to computer vision and start to learn a very popular topic in the computer vision community, which is SIFT. But I am confused with one implementation detail:
After the detection of a key point, we have to construct 4 by 4 local histograms, serving as the final SIFT descriptor, right? Each local histogram contains the orientation of a local neighborhood of 4 by 4 pixels. So overall we have 16 times 16 equals 256 pixels, which are within a neighborhood around the key point. So this neighborhood is a 16 by 16 grid of pixels.
But how is this neighborhood determined in details? Is the neighborhood rotated according to the orientation of key point? Are pixels within this 256-pixel neighborhood separate according to the scale at which the key point is detected?
Thanks for all coming help!

First, SIFT keypoints are extracted at multiple scales. The descriptors are computed using the respective scale. So, I would not say 'pixels' since it can be ambiguous. For your question, I would like to quote the original paper (Section 6.1):
First the image gradient magnitudes
and orientations are sampled around the keypoint location, using the scale of the
keypoint to select the level of Gaussian blur for the image.
In order to achieve orientation
invariance, the coordinates of the descriptor and the gradient orientations are rotated relative
to the keypoint orientation.
A Gaussian weighting function with σ equal to one half the width of the descriptor window
is used to assign a weight to the magnitude of each sample point.
I hope this answers your question. Please do not hesitate to ask if something is unclear.

Related

Finding regions of higher numbers in a matrix

I am working on a project to detect certain objects in an aerial image, and as part of this I am trying to utilize elevation data for the image. I am working with Digital Elevation Models (DEMs), basically a matrix of elevation values. When I am trying to detect trees, for example, I want to search for tree-shaped regions that are higher than their surrounding terrain. Here is an example of a tree in a DEM heatmap:
https://i.stack.imgur.com/pIvlv.png
I want to be able to find small regions like that that are higher than their surroundings.
I am using OpenCV and GDAL for my actual image processing. Do either of those already contain techniques for what I'm trying to accomplish? If not, can you point me in the right direction? Some ideas I've had are going through each pixel and calculating the rate of change in relation to it's surrounding pixels, which would hopefully mean that pixels with high rates change/steep slopes would signify an edge of a raised area.
Note that the elevations will change from image to image, and this needs to work with any elevation. So the ground might be around 10 meters in one image but 20 meters in another.
Supposing you can put the DEM information into a 2D Mat where each "pixel" has the elevation value, you can find local maximums by applying dilate and then substract the result from the original image.
There's a related post with code examples in: http://answers.opencv.org/question/28035/find-local-maximum-in-1d-2d-mat/

Decode a 2D circle colour barcode

I am new to opencv, coding in c++. I have a task given to me to decode a 2D circle barcode using an encoded array. I am up to the point where I am able to centralize the figure and get the line using Hough transforms.
Need help with how to read the colour in the images, note that each of the two adjacent blocks correspond to a letter.
Any pointers will be highly appreciated. Thanks.
First, you need to load the image. I suspect this isn't a problem because you are already using Hough transforms on it, but:
Mat img = imread(filename)
Once the image is loaded, you can grab any of the pixels using:
Scalar intensity = img.at<uchar>(y, x);
However, what you need to do is threshold the image. As I mentioned in the comments, the image colors are either 0 or 255 for each RGB channel. This is on purpose for encoding the data in case there are image artifacts. If the channel is above a certain color value, then you will consider that it's 'on' and if below, it's 'off'.
Threshold the image using adaptiveThreshold. I would threshold down to binary 1 or 0. This will produce RGB triplets that are one of eight (2^3) possible combinations, from (0,0,0) to (1,1,1).
Then you need to walk the pixels. This is where it gets interesting.
You say each adjacent 2 pixels form a single letter. That's 2^6 or 64 different letters. The next question is: are the letters arranged in scan lines, left-to-right, top to bottom? If yes, then it will be important to orientate the image using the crosshair in the center.
If the image is encoded radially (using polar coordinates) then things get a little trickier. You need to use cvLinearPolar to remap the image.
Otherwise you need to walk the whole image, stepping the size of the RGB blocks and discard any pixels whose distance from the center is greater than the radius of the circle. After reading all of the pixels into an array, group them by pairs.
At some point, I would say that using OpenCV to do this is heading towards machine learning. There has to be some point where you can cut in and use Neural Networks to decode the image for you. Once you have the circle (cutoff radius) and the image centered, you can convert to polar coordinates and discard everything outside the circle by cutting off everything greater than the radius of the circle. Remember, polar coordinates are (r,theta), so you should be able to cutoff the right part of the polar image.
Then you could train a Neural Network to take the polar image as input and spit out the paragraph.
You would have to provide lots of training data, and the trained model would still be reliant on your ability to pre-process the image. This will include any affine transforms in case the image is tilted or rotated. At that point you would say to yourself that you've done all the heavy lifting and the last little bit really isn't that hard.
However, once you get a process working for a clean image, you can start adding to steps to introduce ML to work on dirty images. HoughCircles can be used to detect the part of an image to run detection on. Next, you need to decide if the image inside the circle is a barcode or not.
A good barcode system will have parity bits or some other form of error correction, but you can use machine learning to cleanup output.
My 2 cents anyways.

relationship between SIFT keypoint orientation and SIFT description orientation

I am using VLfeat open source for extracting SIFT keypoints and their descriptions. The image below shows one of them. The yellow disc indicates the keypoint's scale (radius) and orientation (line). The green frame indicates its description (i.e., 4x4 8-bin orientation histogram).
The question itself is simple.
Why the "orientation of a keypoint (yellow line)" is different with the "major(most frequent) orientation in its description (most popular bin in green)" here?
As I understand, the orientation of a keypoint is determined by the peak pixel gradient among around. Then, shouldn't it be natural for the orientation to be also shown in green? Is it because the green frame is much bigger than the keypoint's scale?
(source: young at me.berkeley.edu)
There are at least three things to consider in order to explain why this needs not to be the case:
The first one is the fact that the main (yellow) orientation has a 36bin histogram, and the descriptor (green) orientations are 8bin; this allows for an error of a couple (~30) of degrees.
The second one is that the descriptor histograms (green) are calculated after the feature area was rotated by its main (yellow) orientation, so they would, at very least, be shifted by this rotation.
But the most important reason is that both orientations are calculated from the same region but a different neighbourhood (different in size and position) altoghether, so the gradients of them need not to be similar at all.
I think this is just a matter of visualization used in VLfeat. As described here
(source: vlfeat.org)
the "standard oriented frame" will be visualized as a circle with a radius pointing downwards.
The same applies here. If you rotate the frame such that the radius points downwards, then the major gradient direction of the frame should be horizontal, which is agreed in most histograms inside the 4x4 squares.
I think this convention makes sense, because the radius pointing downwards is aligned with the "main strokes" of the frame (which is visually intuitive), but orthogonal to the major gradient direction.

Get all the image pixels with certain pixel values with K-nearest neighbor

I want to obtain all the pixels in an image with pixel values closest to certain pixels in an image. For example, I have an image which has a view of ocean (deep blue), clear sky (light blue), beach, and houses. I want to find all the pixels that are closest to deep blue in order to classify it as water. My problem is sky also gets classified as water. Someone suggested to use K nearest neighbor algorithm, but there are few examples online that use old C style. Can anyone provide me example on K-NN using OpenCv C++?
"Classify it as water" and "obtain all the pixels in an image with pixel values closest to certain pixels in an image" are not the same task. Color properties is not enough for classification you described. You will always have a number of same colored points on water and sky. So you have to use more detailed analysis. For instance if you know your object is self-connected you can use something like water-shred to fill this region and ignore distant and not connected regions in sky of the same color as water (suppose you will successfully detect by edge-detector horizon-line which split water and sky).
Also you can use more information about object you want to select like structure: calculate its entropy etc. Then you can use also K-nearest neighbor algorithm in multi-dimensional space where 1st 3 dimensions is color, 4th - entropy etc. But you can also simply check every image pixel if it is in epsilon-neighborhood of selected pixels area (I mean color-entropy 4D-space, 3 dimension from color + 1 dimension from entropy) using simple Euclidean metric -- it is pretty fast and could be accelerated by GPU .

Calculating the precision of homography on 2D plane

I am trying to find a way to parametrize the precision of my homography calculation. I would like to obtain a value that describes the precision of the homography calculation for a measurement taken at a certain position.
I currently have succesfully calculated the homography (with cv::findHomography) and I can use it to map a point on my camera image onto a 2D map (using cv::perspectiveTransform). Now I want to track these objects on my 2D map and to do this I want to take in account that objects that are in the back of my camera image have a less precise position on my 2D map than the objects that are all the way in the front.
I have looked at the following example on this website that mentions plane fitting but I don't really understand how to fill the matrices correctly using this method. The visualisation of the result does seem to fit my needs. Is there any way to do this with standard OpenCV functions?
EDIT:
Thanks Francesco for your recommendations. But, I think I am looking for something different than your answer. I am not looking to test the precision of the homography itself, but the relation between the density of measurements in one real camera view and the actual size on a map I create. I want to know that when I am 1 pixel off on my detection in the camera image, how many meters this will be on my map at this point.
I can of course calculate by taking some pixels around my measurement on my camera image and then use the homography to see how many meters on my map this represent every time I do a homography, but I don't want to calculate this every time. What I would like is to have a formula that tells me the relation between pixels in my image and pixels on my map so I can take this in account for my tracking on the map.
What you are looking for is called "predictive error bars" or "prediction uncertainty". You should definitely consult a good introductory book on estimation theory for details (e.g. this one). But briefly, the predictive uncertainty is the probability that...
A certain pixel p in image 1 will is the mapping H(p') of a pixel p' in image 2 under the homography H...
Given the uncertainty in H which is due to the errors in the matched pairs (q0, q0'), (q1, q1'), ..., that have been used to estimate H, ...
But assuming the model is correct, that is, that the true map between images 1 and 2 is, in fact, a homography (although the estimated parameters of the homography itself may be affected by errors).
In order to estimate this probability distribution you'll need a model for the errors in the measurements, and a model for how they propagate through the (homography) model.