dense sift and vlfeat - computer-vision

I want to ask two questions about dense sift(dsift) and vlfeat:
Any material that details dsift? I have seen many that said "dense SIFT is the SIFT's application to dense grids". But what does this mean? Can it be described in a more detailed manner? I read the source code dsift.c and dsift.h in vlfeat and the technique details about dsift. But there are many things I cannot understand. Existing papers usually focus on the application of dsift.
I use vlfeat in my C program and it works fine. But when I custom the parameters with vl_dsift_set_geometry, it goes wrong. Because I do not know how dsift works, I do not know how to set binSizeX/Y and numBinX/Y properly. I read in a paper "patch size 76". Does patch refer to a 4*4 grid? I somewhat got confused by the terms bin, patch and grid. Well, my question is, with patch size being 76, how to set binSizeX/Y and numBinX/Y?(image size 256*256)?

In SIFT, the first step is to detect key points. Key points detection is performed at multiple scale.
The next step is to describe the key point to generate the descriptor.
The distribution of the key points over the image is not uniform, depending on the detected key points.
In dense sift features, there is no key points detection, based on a grid at specific points, sift features will be detected at specific scale. This is not useful if you are matching objects that may appear at different scales.
There is the phow version which is a combination between dense sift and sift. Instead of detecting sift at pre-specified locations and pre-specified scales, sift features are detected at pre-specified locations but different scales. In phow, all sift features detected at the same point ( different scales) will be combined together to construct a single feature at the location

Related

Averaging SIFT features to do pose estimation

I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
"Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at -30, 10, and 40 degrees ( from surface normal), use the descriptor from the 10 degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.
The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a 3d point from your point cloud in any of its observations should be very close, so you can use any of them 1.
Also, it seems to me that you are trying to directly match the 2d points to the 3d points. From a computational point of view, I think this is not a very good idea, because by matching 2d points with 3d ones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any 2d/2d matching, but just 2d/3d matching?
Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image I_0, and let's note the images from your SFM I_1, ..., I_n. First, match between I_0 and I_1. Now, assume q_0 is a 2d point from I_0 that has successfully been matched to q_1 from I_1, which corresponds to some 3d point Q. Now, to ensure consistency, consider the reprojection of Q in I_2, and call it q_2. Match I_0 and I_2. Does the point to which q_0 is match in I_2 fall close to q_2? If yes, keep the 2d/3d match between q_0 and Q, and so on.
I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor s_0 from I_0, and let's note F(s_1,...,s_n) your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptors s_i in their corresponding I_i, etc). Then when matching s_0 with F, you will only want to use a subset of the s_i that belong to images that have close viewpoints to I_0 (because of the 30deg problem that you mention, although I think it should be 50deg). That means that you have to attribute a weight to each s_i that depends on the pose of your query I_0. You obviously can't do that when constructing F, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:
SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the 2d point of your query and 3d descriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...
tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.

2D object detection with only a single training image

The vision system is given a single training image (e.g. a piece of 2D artwork ) and it is asked whether the piece of artwork is present in the newly captured photos. The newly captured photos can contain a lot of any other object and when the artwork is presented, it must face up but may be occluded.
The pose space is x,y,rotation and scale. The artwork may be highly symmetric or not.
What is the latest state of the art handling this kind of problem?
I have tried/considered the following options but there are some problems in all of them. If my argument is invalid please correct me.
deep learning(rcnn/yolo): a lot of labeled data are needed means a lot of human labor is needed for each new pieces of artwork.
traditional machine learning(SVM,Random forest): same as above
sift/surf/orb + ransac or voting: when the artwork is symmetric, the features matched are mostly incorrect. A lot of time is needed in the ransac/voting stage.
generalized hough transform: the state space is too large for the voting table. Pyramid can be applied but it is difficult to choose some universal thresholds for different kinds of artwork to proceed down the pyramid.
chamfer matching: the state space is too large. Too much time is needed in searching across the state space.
Object detection requires a lot of labeled data of the same class to generalize well, and in your setting it would be impossible to train a network with only single instance.
I assume that in your case online object trackers can work, at least give it a try. There are some convolutional object trackers that work great like Siamese CNNs. The code is open source at github, and you can watch this video to see its performance.
Online object tracking: Given the initialized state (e.g., position
and size) of a target object in a frame of a video, the goal
of tracking is to estimate the states of the target in the subsequent
frames.-source-
You can try using traditional feature based image processing algorithm which might give true positive template matches up to a descent accuracy.
Given template image as in the question:
First dilate the image to join all very closely spaced connected
components.
Find the convex hull of the connected object obtained above,This will give you a polygon.
Use above polygon edge length information like (max-length/min-length) ratio as feature of the template.
Also find the pixel density in the polygon as second feature.
We have 2 features now.
Scene image feature vector:
Similarly Again in the scene image use dilation followed by connected components identification, define convex hull(polygon) around each connected objects and define feature vector for each object(edge info, pixel density).
Now as usual search for template feature vector in the scene image feature vectors data with minimum feature distance(also use certain upper level distance threshold value to avoid false positive matches).
This should give the true positive matches if available in the scene image.
Exception: This method would not work for occluded objects.

Building a simple image search using TensorFlow

I need to implement a simple image search in my app using TensorFlow.
The requirements are these:
The dataset contains around a million images, all of the same size, each containing one unique object and only that object.
The search parameter is an image taken with a phone camera of some object that is potentially in the dataset.
I've managed to extract the image from the camera picture and straighten it to rectangular form and as a result, a reverse-search image indexer like TinEye was able to find a match.
Now I want to reproduce that indexer by using TensorFlow to create a model based on my data-set (make each image's file name a unique index).
Could anyone point me to tutorials/code that would explain how to achieve such thing without diving too much into computer vision terminology?
Much appreciated!
The Wikipedia article on TinEye says that Perceptual Hashing will yield results similar to TinEye's. They reference this detailed description of the algorithm. But TinEye refuses to comment.
The biggest issue with the Perceptual Hashing approach is that while it's efficient for identifying the same image (subject to skews, contrast changes, etc.), it's not great at identifying a completely different image of the same object (e.g. the front of a car vs. the side of a car).
TensorFlow has great support for deep neural nets which might give you better results. Here's a high level description of how you might use a deep neural net in TensorFlow to solve this problem:
Start with a pre-trained NN (such as GoogLeNet) or train one yourself on a dataset like ImageNet. Now we're given a new picture we're trying to identify. Feed that into the NN. Look at the activations of a fairly deep layer in the NN. This vector of activations is like a 'fingerprint' for the image. Find the picture in your database with the closest fingerprint. If it's sufficiently close, it's probably the same object.
The intuition behind this approach is that unlike Perceptual Hashing, the NN is building up a high-level representation of the image including identifying edges, shapes, and important colors. For example, the fingerprint of an apple might include information about its circular shape, red color, and even its small stem.
You could also try something like this 2012 paper on image retrieval which uses a slew of hand-picked features such as SIFT, regional color moments and object contour fragments. This is probably a lot more work and it's not what TensorFlow is best at.
UPDATE
OP has provided an example pair of images from his application:
Here are the results of using the demo on the pHash.org website on that pair of similar images as well as on a pair of completely dissimilar images.
Comparing the two images provided by the OP:
RADISH (radial hash): pHash determined your images are not similar with PCC = 0.518013
DCT hash: pHash determined your images are not similar with hamming distance = 32.000000.
Marr/Mexican hat wavelet: pHash determined your images are not similar with normalized hamming distance = 0.480903.
Comparing one of his images with a random image from my machine:
RADISH (radial hash): pHash determined your images are not similar with PCC = 0.690619.
DCT hash: pHash determined your images are not similar with hamming distance = 27.000000.
Marr/Mexican hat wavelet: pHash determined your images are not similar with normalized hamming distance = 0.519097.
Conclusion
We'll have to test more images to really know. But so far pHash does not seem to be doing very well. With the default thresholds it doesn't consider the similar images to be similar. And for one algorithm, it actually considers a completely random image to be more similar.
https://github.com/wuzhenyusjtu/VisualSearchServer
It is a simple implementation of similar image searching using TensorFlow and InceptionV3 model. The code implements two methods, a server that handles image search, and a simple indexer that do Nearest Neighbor matching based on the pool3 features extracted.

Clustering feature Space - SURF descriptors with Adaptive MeanShift

didnt find anything on internet. There have been some papers around recently about
clustering feature space descriptors (such from SIFT/SURF) using the Mean Shift algo.
Does anybody have any links or any code/library/tip to actually cluster SURF descriptors? (Matlab/C++)
I've already tried to use the 1D Mean-Shift (which perfectly works on the locations of the points) and also some other mean shifts which were avaiable...though all seem to have problems with higher dimensions.
Thanks in advance!
Why are you using a 1D classification algorithm with a high-dimensional dataset? Mean-shift segmentation is an unsupervised classification task while SIFT and SURF are used to find keypoints in an image. There is only one mean-shift. There are other alterntives such as CAMshift but are mostly independent of mean-shift. SURF and mean-shift are independent algorithms. Therefores you will find no implementation with dependencies unless it is tailored for a specific application.
Thereforemroe SIFT commonly employs a 128-dimensional EoH-based descriptor (similar dimensionality to the extended SURF descriptor) for a given keypoint. If you are going to account for the local position of each pixel (x,y) you will have a 130 dimensional feature space, not 1D.
If you wish to categorise the edge information in an image, you should first localise the keypoints in an image using SIFT or SURF. Then use a concatenated vector of the EoH and pixel position as the input to the segmentation algorithm. If you search on google or mathworks functions for an N-dimensional mean-shift algorithm you would have found one. Its the same process for a 1D dataset so has no gain being hard coded for a 1D user case. You would have also found that MATLAB's image toolbox already contains a SURF implementation.
Mean-Shift: http://www.mathworks.co.uk/matlabcentral/fileexchange/10161-mean-shift-clustering
SURF: http://www.mathworks.co.uk/help/vision/examples/object-detection-in-a-cluttered-scene-using-point-feature-matching.html
The C++ and MATLAB SIFT implementations are referenced on the original paper and it's site (A. Vedaldi, "An implementation of SIFT detector and descriptor", 2004).
SIFT: http://www.robots.ox.ac.uk/~vedaldi/code/sift.html
Original SURF paper: http://www.vision.ee.ethz.ch/~surf/eccv06.pdf
Original SIFT paper: http://www.robots.ox.ac.uk/~vedaldi/assets/sift/sift.pdf

Detecting a cross in an image with OpenCV

I'm trying to detect a shape (a cross) in my input video stream with the help of OpenCV. Currently I'm thresholding to get a binary image of my cross which works pretty good. Unfortunately my algorithm to decide whether the extracted blob is a cross or not doesn't perform very good. As you can see in the image below, not all corners are detected under certain perspectives.
I'm using findContours() and approxPolyDP() to get an approximation of my contour. If I'm detecting 12 corners / vertices in this approximated curve, the blob is assumed to be a cross.
Is there any better way to solve this problem? I thought about SIFT, but the algorithm has to perform in real-time and I read that SIFT is not really suitable for real-time.
I have a couple of suggestions that might provide some interesting results although I am not certain about either.
If the cross is always near the center of your image and always lies on a planar surface you could try to find a homography between the camera and the plane upon which the cross lies. This would enable you to transform a sample image of the cross (at a selection of different in plane rotations) to the coordinate system of the visualized cross. You could then generate templates which you could match to the image. You could do some simple pixel agreement tests to determine if you have a match.
Alternatively you could try to train a Haar-based classifier to recognize the cross. This type of classifier is often used in face detection and detects oriented edges in images, classifying faces by the relative positions of several oriented edges. It has good classification accuracy on faces and is extremely fast. Although I cannot vouch for its accuracy in this particular situation it might provide some good results for simple shapes such as a cross.
Computing the convex hull and then taking advantage of the convexity defects might work.
All crosses should have four convexity defects, making up four sets of two points, or four vectors. Furthermore, if your shape was a cross then these four vectors will have two pairs of supplementary angles.