Averaging SIFT features to do pose estimation - computer-vision

I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
"Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at -30, 10, and 40 degrees ( from surface normal), use the descriptor from the 10 degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.

The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a 3d point from your point cloud in any of its observations should be very close, so you can use any of them 1.
Also, it seems to me that you are trying to directly match the 2d points to the 3d points. From a computational point of view, I think this is not a very good idea, because by matching 2d points with 3d ones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any 2d/2d matching, but just 2d/3d matching?
Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image I_0, and let's note the images from your SFM I_1, ..., I_n. First, match between I_0 and I_1. Now, assume q_0 is a 2d point from I_0 that has successfully been matched to q_1 from I_1, which corresponds to some 3d point Q. Now, to ensure consistency, consider the reprojection of Q in I_2, and call it q_2. Match I_0 and I_2. Does the point to which q_0 is match in I_2 fall close to q_2? If yes, keep the 2d/3d match between q_0 and Q, and so on.
I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor s_0 from I_0, and let's note F(s_1,...,s_n) your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptors s_i in their corresponding I_i, etc). Then when matching s_0 with F, you will only want to use a subset of the s_i that belong to images that have close viewpoints to I_0 (because of the 30deg problem that you mention, although I think it should be 50deg). That means that you have to attribute a weight to each s_i that depends on the pose of your query I_0. You obviously can't do that when constructing F, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:
SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the 2d point of your query and 3d descriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...
tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.

Related

2D object detection with only a single training image

The vision system is given a single training image (e.g. a piece of 2D artwork ) and it is asked whether the piece of artwork is present in the newly captured photos. The newly captured photos can contain a lot of any other object and when the artwork is presented, it must face up but may be occluded.
The pose space is x,y,rotation and scale. The artwork may be highly symmetric or not.
What is the latest state of the art handling this kind of problem?
I have tried/considered the following options but there are some problems in all of them. If my argument is invalid please correct me.
deep learning(rcnn/yolo): a lot of labeled data are needed means a lot of human labor is needed for each new pieces of artwork.
traditional machine learning(SVM,Random forest): same as above
sift/surf/orb + ransac or voting: when the artwork is symmetric, the features matched are mostly incorrect. A lot of time is needed in the ransac/voting stage.
generalized hough transform: the state space is too large for the voting table. Pyramid can be applied but it is difficult to choose some universal thresholds for different kinds of artwork to proceed down the pyramid.
chamfer matching: the state space is too large. Too much time is needed in searching across the state space.
Object detection requires a lot of labeled data of the same class to generalize well, and in your setting it would be impossible to train a network with only single instance.
I assume that in your case online object trackers can work, at least give it a try. There are some convolutional object trackers that work great like Siamese CNNs. The code is open source at github, and you can watch this video to see its performance.
Online object tracking: Given the initialized state (e.g., position
and size) of a target object in a frame of a video, the goal
of tracking is to estimate the states of the target in the subsequent
frames.-source-
You can try using traditional feature based image processing algorithm which might give true positive template matches up to a descent accuracy.
Given template image as in the question:
First dilate the image to join all very closely spaced connected
components.
Find the convex hull of the connected object obtained above,This will give you a polygon.
Use above polygon edge length information like (max-length/min-length) ratio as feature of the template.
Also find the pixel density in the polygon as second feature.
We have 2 features now.
Scene image feature vector:
Similarly Again in the scene image use dilation followed by connected components identification, define convex hull(polygon) around each connected objects and define feature vector for each object(edge info, pixel density).
Now as usual search for template feature vector in the scene image feature vectors data with minimum feature distance(also use certain upper level distance threshold value to avoid false positive matches).
This should give the true positive matches if available in the scene image.
Exception: This method would not work for occluded objects.

Detecting/correcting Photo Warping via Point Correspondences

I realize there are many cans of worms related to what I'm asking, but I have to start somewhere. Basically, what I'm asking is:
Given two photos of a scene, taken with unknown cameras, to what extent can I determine the (relative) warping between the photos?
Below are two images of the 1904 World's Fair. They were taken at different levels on the wireless telegraph tower, so the cameras are more or less vertically in line. My goal is to create a model of the area (in Blender, if it matters) from these and other photos. I'm not looking for a fully automated solution, e.g., I have no problem with manually picking points and features.
Over the past month, I've taught myself what I can about projective transformations and epipolar geometry. For some pairs of photos, I can do pretty well by finding the fundamental matrix F from point correspondences. But the two below are causing me problems. I suspect that there's some sort of warping - maybe just an aspect ratio change, maybe more than that.
My process is as follows:
I find correspondences between the two photos (the red jagged lines seen below).
I run the point pairs through Matlab (actually Octave) to find the epipoles. Currently, I'm using Peter Kovesi's
Peter's Functions for Computer Vision.
In Blender, I set up two cameras with the images overlaid. I orient the first camera based on the vanishing points. I also determine the focal lengths from the vanishing points. I orient the second camera relative to the first using the epipoles and one of the point pairs (below, the point at the top of the bandstand).
For each point pair, I project a ray from each camera through its sample point, and mark the closest covergence of the pair (in light yellow below). I realize that this leaves out information from the fundamental matrix - see below.
As you can see, the points don't converge very well. The ones from the left spread out the further you go horizontally from the bandstand point. I'm guessing that this shows differences in the camera intrinsics. Unfortunately, I can't find a way to find the intrinsics from an F derived from point correspondences.
In the end, I don't think I care about the individual intrinsics per se. What I really need is a way to apply the intrinsics to "correct" the images so that I can use them as overlays to manually refine the model.
Is this possible? Do I need other information? Obviously, I have little hope of finding anything about the camera intrinsics. There is some obvious structural info though, such as which features are orthogonal. I saw a hint somewhere that the vanishing points can be used to further refine or upgrade the transformations, but I couldn't find anything specific.
Update 1
I may have found a solution, but I'd like someone with some knowledge of the subject to weigh in before I post it as an answer. It turns out that Peter's Functions for Computer Vision has a function for doing a RANSAC estimate of the homography from the sample points. Using m2 = H*m1, I should be able to plot the mapping of m1 -> m2 over top of the actual m2 points on the second image.
The only problem is, I'm not sure I believe what I'm seeing. Even on an image pair that lines up pretty well using the epipoles from F, the mapping from the homography looks pretty bad.
I'll try to capture an understandable image, but is there anything wrong with my reasoning?
A couple answers and suggestions (in no particular order):
A homography will only correctly map between point correspondences when either (a) the camera undergoes a pure rotation (no translation) or (b) the corresponding points are all co-planar.
The fundamental matrix only relates uncalibrated cameras. The process of recovering a camera's calibration parameters (intrinsics) from unknown scenes, known as "auto-calibration" is a rather difficult problem. You'd need these parameters (focal length, principal point) to correctly reconstruct the scene.
If you have (many) more images of this scene, you could try using a system such as Visual SFM: http://ccwu.me/vsfm/ It will attempt to automatically solve the Structure From Motion problem, including point matching, auto-calibration and sparse 3D reconstruction.

OpenCV detect image against a image set

I would like to know how I can use OpenCV to detect on my VideoCamera a Image. The Image can be one of 500 images.
What I'm doing at the moment:
- (void)viewDidLoad
{
[super viewDidLoad];
// Do any additional setup after loading the view.
self.videoCamera = [[CvVideoCamera alloc] initWithParentView:imageView];
self.videoCamera.delegate = self;
self.videoCamera.defaultAVCaptureDevicePosition = AVCaptureDevicePositionBack;
self.videoCamera.defaultAVCaptureSessionPreset = AVCaptureSessionPresetHigh;
self.videoCamera.defaultAVCaptureVideoOrientation = AVCaptureVideoOrientationPortrait;
self.videoCamera.defaultFPS = 30;
self.videoCamera.grayscaleMode = NO;
}
-(void)viewDidAppear:(BOOL)animated{
[super viewDidAppear:animated];
[self.videoCamera start];
}
#pragma mark - Protocol CvVideoCameraDelegate
#ifdef __cplusplus
- (void)processImage:(cv::Mat&)image;
{
// Do some OpenCV stuff with the image
cv::Mat image_copy;
cvtColor(image, image_copy, CV_BGRA2BGR);
// invert image
//bitwise_not(image_copy, image_copy);
//cvtColor(image_copy, image, CV_BGR2BGRA);
}
#endif
The images that I would like to detect are 2-5kb small. Few got text on them but others are just signs. Here a example:
Do you guys know how I can do that?
There are several things in here. I will break down your problem and point you towards some possible solutions.
Classification: Your main task consists on determining if a certain image belongs to a class. This problem by itself can be decomposed in several problems:
Feature Representation You need to decide how you are gonna model your feature, i.e. how are you going to represent each image in a feature space so you can train a classifier to separate those classes. The feature representation by itself is already a big design decision. One could (i) calculate the histogram of the images using n bins and train a classifier or (ii) you could choose a sequence of random patches comparison such as in a random forest. However, after the training, you need to evaluate the performance of your algorithm to see how good your decision was.
There is a known problem called overfitting, which is when you learn too well that you can not generalize your classifier. This can usually be avoided with cross-validation. If you are not familiar with the concept of false positive or false negative, take a look in this article.
Once you define your feature space, you need to choose an algorithm to train that data and this might be considered as your biggest decision. There are several algorithms coming out every day. To name a few of the classical ones: Naive Bayes, SVM, Random Forests, and more recently the community has obtained great results using Deep learning. Each one of those have their own specific usage (e.g. SVM ares great for binary classification) and you need to be familiar with the problem. You can start with simple assumptions such as independence between random variables and train a Naive Bayes classifier to try to separate your images.
Patches: Now you mentioned that you would like to recognize the images on your webcam. If you are going to print the images and display in a video, you need to handle several things. it is necessary to define patches on your big image (input from the webcam) in which you build a feature representation for each patch and classify in the same way you did in the previous step. For doing that, you could slide a window and classify all the patches to see if they belong to the negative class or to one of the positive ones. There are other alternatives.
Scale: Considering that you are able to detect the location of images in the big image and classify it, the next step is to relax the toy assumption of fixes scale. To handle a multiscale approach, you could image pyramid which pretty much allows you to perform the detection in multiresolution. Alternative approaches could consider keypoint detectors, such as SIFT and SURF. Inside SIFT, there is an image pyramid which allows the invariance.
Projection So far we assumed that you had images under orthographic projection, but most likely you will have slight perspective projections which will make the whole previous assumption fail. One naive solution for that would be for instance detect the corners of the white background of your image and rectify the image before building the feature vector for classification. If you used SIFT or SURF, you could design a way of avoiding explicitly handling that. Nevertheless, if your input is gonna be just squares patches, such as in ARToolkit, I would go for manual rectification.
I hope I might have given you a better picture of your problem.
I would recommend using SURF for that, because pictures can be on different distances form your camera, i.e changing the scale. I had one similar experiment and SURF worked just as expected. But SURF has very difficult adjustment (and expensive operations), you should try different setups before you get the needed results.
Here is a link: http://docs.opencv.org/modules/nonfree/doc/feature_detection.html
youtube video (in C#, but can give an idea): http://www.youtube.com/watch?v=zjxWpKCQqJc
I might not be qualified enough to answer this problem. Last time I seriously use OpenCV it was still 1.1. But just some thought on it, and hope it would help (currently I am interested in DIP and ML).
I think it will probably an easier task if you only need to classify an image, if the image is just one from (or very similar to) your 500 images. For this you could use SVM or some neural network (Felix already gave an excellent enumeration on that).
However, your problem seems to be that you need to first find this candidate image in your webcam, the location of which you have little clue beforehand. (let us know whether it is so. I think it is important.)
If so, the harder problem is the detection/localization of your candidate image.
I don't have a general solution for that. The first thing I would do is to see if there is some common feature in your 500 images (e.g., whether all of them enclosed by a red circle, or, half of them have circle and half of them have rectangle). If this can be done, the problem will be simpler (it would be similar to face detection problem, which have good solution).
In other words, this means that you first classify the 500 images to a few groups with common feature (by human), and detect the group first, then scale and use above mentioned technique to classify them into fine result. In this way, it will be more computationally acceptable than trying to detect 500 images one by one.
BTW, this ppt would help to give a visual clue of what is going on for feature extraction and image matching http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf.
Detect vs recognize: detecting the image is just finding it on the background and from your comments I realized you may have your sings surrounded by the background. It might facilitate your algorithm if you can somehow crop your signs from the background (detect) before trying to recognize them. Recognizing is a next stage that presumes you can classify the cropped image correctly as the one seen before.
If you need real time speed and scale/rotation invariance neither SIFT no SURF will do this fast. Nowadays you can do much better if you shift the burden of image processing to a learning stage as was done by Lepitit. In short, he subjected each pattern to a bunch of affine transformations and trained a binary classification tree to recognize each point correctly by doing a lot of binary comparison tests. Trees are extremely fast and a way to go not to mention that most of the processing is done offline. This method is also more robust to off-plane rotations than SIFT or SURF. You will also learn about tree classification which may facilitate you last processing stage.
Finally a recognition stage is based not only on the number of matches but also on their geometric consistency. Since your signs look flat I suggest finding either affine or homography transformation that has most inliers when calculated between matched points.
Looking at your code though I realized that you may not follow any of these recommendations. It may be a good starting point for you to read about decision trees and then play with some sample code (see mushroom.cpp in the above mentioned link)

Target Detection - Algorithm suggestions

I am trying to do image detection in C++. I have two images:
Image Scene: 1024x786
Person: 36x49
And I need to identify this particular person from the scene. I've tried to use Correlation but the image is too noisy and therefore doesn't give correct/accurate results.
I've been thinking/researching methods that would best solve this task and these seem the most logical:
Gaussian filters
Convolution
FFT
Basically, I would like to move the noise around the images, so then I can use Correlation to find the person more effectively.
I understand that an FFT will be hard to implement and/or may be slow especially with the size of the image I'm using.
Could anyone offer any pointers to solving this? What would the best technique/algorithm be?
In Andrew Ng's Machine Learning class we did this exact problem using neural networks and a sliding window:
train a neural network to recognize the particular feature you're looking for using data with tags for what the images are, using a 36x49 window (or whatever other size you want).
for recognizing a new image, take the 36x49 rectangle and slide it across the image, testing at each location. When you move to a new location, move the window right by a certain number of pixels, call it the jump_size (say 5 pixels). When you reach the right-hand side of the image, go back to 0 and increment the y of your window by jump_size.
Neural networks are good for this because the noise isn't a huge issue: you don't need to remove it. It's also good because it can recognize images similar to ones it has seen before, but are slightly different (the face is at a different angle, the lighting is slightly different, etc.).
Of course, the downside is that you need the training data to do it. If you don't have a set of pre-tagged images then you might be out of luck - although if you have a Facebook account you can probably write a script to pull all of yours and your friends' tagged photos and use that.
A FFT does only make sense when you already have sort the image with kd-tree or a hierarchical tree. I would suggest to map the image 2d rgb values to a 1d curve and reducing some complexity before a frequency analysis.
I do not have an exact algorithm to propose because I have found that target detection method depend greatly on the specific situation. Instead, I have some tips and advices. Here is what I would suggest: find a specific characteristic of your target and design your code around it.
For example, if you have access to the color image, use the fact that Wally doesn't have much green and blue color. Subtract the average of blue and green from the red image, you'll have a much better starting point. (Apply the same operation on both the image and the target.) This will not work, though, if the noise is color-dependent (ie: is different on each color).
You could then use correlation on the transformed images with better result. The negative point of correlation is that it will work only with an exact cut-out of the first image... Not very useful if you need to find the target to help you find the target! Instead, I suppose that an averaged version of your target (a combination of many Wally pictures) would work up to some point.
My final advice: In my personal experience of working with noisy images, spectral analysis is usually a good thing because the noise tend to contaminate only one particular scale (which would hopefully be a different scale than Wally's!) In addition, correlation is mathematically equivalent to comparing the spectral characteristic of your image and the target.

Detecting curves in OpenCV

I am just starting to use OpenCV to detect specific curves in an image. First, I want to verify if there is a curve, and next, I would like to identify the type of curve according to vertical or horizontal convex or concave curve. Is there an available function in OpenCV? If not, can you give me some ideas about how can I possibly write such a function? Thanks! By the way, I'm using C++.
Template matching is not a robust way to solve this problem (its like looking at an object from a small pinhole) and edge detectors don't necessarily return you the true edges in the image; false edges such as those due to shadows are returned too. Further, you have to deal with the problem of incomplete edges and other problems that scales up with the complexity of the scene in your image.
The problem you posed, in general, is a very challenging one and, except for toy examples, there are no good solutions.
A rough attempt could be to first try to detect plausible edges using an edge detector (e.g. the canny edge detector suggested). Next, use RANSAC to try to fit a subset of the points in the detected edges to your curve model.
For e.g. let's say you are trying to detect a curve of the following form f(x) = ax^2 + bx + c. RANSAC will basically try to find from among the points in the detected edges, a subset of them that would best fit this curve model. To detect different curves, change f(x) accordingly and run RANSAC for each of them. You can then try to determine if the curve represented by f(x) really exists in your image using some heuristic applied to from the points that were assigned to it by RANSAC (e.g. if too few points were fitted to the model it is likely that the curve is not there. But how to determine a good threshold for the number of points?). You model will get more complex when you have to account for allowable transformation such as rotation etc.
The problem with this approach is you are basically trying fit what you think should be in the image to the points and sometimes, even if what you are looking for is not there, it will return you the "best possible" fit. For e.g. you have a whole bunch of points detected from a concentric circle. If you try to detect straight lines from these points, RANSAC will return you the best fit line! In fact, it could give you many different lines from different runs depending on which points it selected during its random initialization stage.
For more details on how to use RANSAC on this sort of problem, have a look at RANSAC for Dummies by Marco Zuliani. He also has a nice MATLAB toolbox to accompany this tech report, which you can probably port to the language of your choice.
Unless you know what you background looks like, or if you are in control of it e.g. by forcing a clean background, this is a very difficult problem to solve.