Optical flow vs keypoint matching: what are the differences? - computer-vision

I've spent some months studying and making experiments with the process of keypoint detection, description and matching. In the last period, I'm also into the concepts behind augmented reality, precisely the "markerless" recognition and pose estimation.
Luckily, I've found that the previous concepts are still widely used in this setting. A common pipeline to create a basic augmented reality is the following, without going on details about each needed algorithm:
While capturing a video, at every frame...
Get some keypoints and create their descriptors
Find some matches between these points and the ones inside a previously saved "marker" (like a photo)
If matches are enough, estimate the pose of the visible object and play with it
That is, a very simplified procedure used, for example, by this student(?) project.
Now the question: during my personal researches, I've also found another method called "optical flow". I'm still at the beginning of the study, but first I would like to know how much different is it from the previous method. Specifically:
What are the main concepts behind it? Does it use a "subset" of the algorithms roughly described before?
What are the main differences in term of computational costs, performance, stability and accurancy? (I know this could be a too generalized question)
Which one of those is more used in commercial AR tools? (junaio, Layar, ...)
Thanks for your cooperation.

Optical flow(OF) is method around so called "brightness constancy assumption". You assume that pixels - more specific, theirs intensities (up to some delta) - are not changing, only shifting. And you find solution of this equation:
I(x,y,t) = I(x+dx, y+dy, t+dt).
First order of Tailor series is:
I(x + dx, y+dy, t+ dt) = I (x,y,t) + I_x * dx + I_y * dy + I_t * dt.
Then you solve this equation and get dx and dy - shifts for every pixel.
Optical flow is mainly used for tracking and odometry.
upd.: if applied not to whole image, but the patch, optical flow is almost the same, as Lucas-Kanade-Tomashi tracker.
Difference between this method and feature-based methods is density. With feature points you usually get difference in position of the feature points only, while optical flow estimates it for whole image.
The drawback is that vanilla OF works only for small displacements. For handling larger ones, one can downscale image and calculate OF on it - "coarse-to-fine" method.
One can change "brightness constancy assumption" to, i.e., "descriptor constancy assumption" and solve the same equation but with descriptor value instead of raw intensity. SIFT flow is an example of this .
Unfortunately, I don`t know much about augmented reality commercial solutions and cannot answer the last question.

Optical flow computation is slow, while recent developments in detection speed has been a lot. Augmented reality commercial solutions require realtime performance. Hence, it is hard to apply optical flow based techniques (until you use a good GPU). AR systems mostly use feature based techniques. Most of the time their aim is to know the 3D geometry of the scene, which can be robustly estimated by a set of points. Other differences have been mentioned by old-ufo.

Related

Averaging SIFT features to do pose estimation

I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
"Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at -30, 10, and 40 degrees ( from surface normal), use the descriptor from the 10 degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.
The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a 3d point from your point cloud in any of its observations should be very close, so you can use any of them 1.
Also, it seems to me that you are trying to directly match the 2d points to the 3d points. From a computational point of view, I think this is not a very good idea, because by matching 2d points with 3d ones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any 2d/2d matching, but just 2d/3d matching?
Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image I_0, and let's note the images from your SFM I_1, ..., I_n. First, match between I_0 and I_1. Now, assume q_0 is a 2d point from I_0 that has successfully been matched to q_1 from I_1, which corresponds to some 3d point Q. Now, to ensure consistency, consider the reprojection of Q in I_2, and call it q_2. Match I_0 and I_2. Does the point to which q_0 is match in I_2 fall close to q_2? If yes, keep the 2d/3d match between q_0 and Q, and so on.
I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor s_0 from I_0, and let's note F(s_1,...,s_n) your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptors s_i in their corresponding I_i, etc). Then when matching s_0 with F, you will only want to use a subset of the s_i that belong to images that have close viewpoints to I_0 (because of the 30deg problem that you mention, although I think it should be 50deg). That means that you have to attribute a weight to each s_i that depends on the pose of your query I_0. You obviously can't do that when constructing F, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:
SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the 2d point of your query and 3d descriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...
tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.

OpenCV detect image against a image set

I would like to know how I can use OpenCV to detect on my VideoCamera a Image. The Image can be one of 500 images.
What I'm doing at the moment:
- (void)viewDidLoad
{
[super viewDidLoad];
// Do any additional setup after loading the view.
self.videoCamera = [[CvVideoCamera alloc] initWithParentView:imageView];
self.videoCamera.delegate = self;
self.videoCamera.defaultAVCaptureDevicePosition = AVCaptureDevicePositionBack;
self.videoCamera.defaultAVCaptureSessionPreset = AVCaptureSessionPresetHigh;
self.videoCamera.defaultAVCaptureVideoOrientation = AVCaptureVideoOrientationPortrait;
self.videoCamera.defaultFPS = 30;
self.videoCamera.grayscaleMode = NO;
}
-(void)viewDidAppear:(BOOL)animated{
[super viewDidAppear:animated];
[self.videoCamera start];
}
#pragma mark - Protocol CvVideoCameraDelegate
#ifdef __cplusplus
- (void)processImage:(cv::Mat&)image;
{
// Do some OpenCV stuff with the image
cv::Mat image_copy;
cvtColor(image, image_copy, CV_BGRA2BGR);
// invert image
//bitwise_not(image_copy, image_copy);
//cvtColor(image_copy, image, CV_BGR2BGRA);
}
#endif
The images that I would like to detect are 2-5kb small. Few got text on them but others are just signs. Here a example:
Do you guys know how I can do that?
There are several things in here. I will break down your problem and point you towards some possible solutions.
Classification: Your main task consists on determining if a certain image belongs to a class. This problem by itself can be decomposed in several problems:
Feature Representation You need to decide how you are gonna model your feature, i.e. how are you going to represent each image in a feature space so you can train a classifier to separate those classes. The feature representation by itself is already a big design decision. One could (i) calculate the histogram of the images using n bins and train a classifier or (ii) you could choose a sequence of random patches comparison such as in a random forest. However, after the training, you need to evaluate the performance of your algorithm to see how good your decision was.
There is a known problem called overfitting, which is when you learn too well that you can not generalize your classifier. This can usually be avoided with cross-validation. If you are not familiar with the concept of false positive or false negative, take a look in this article.
Once you define your feature space, you need to choose an algorithm to train that data and this might be considered as your biggest decision. There are several algorithms coming out every day. To name a few of the classical ones: Naive Bayes, SVM, Random Forests, and more recently the community has obtained great results using Deep learning. Each one of those have their own specific usage (e.g. SVM ares great for binary classification) and you need to be familiar with the problem. You can start with simple assumptions such as independence between random variables and train a Naive Bayes classifier to try to separate your images.
Patches: Now you mentioned that you would like to recognize the images on your webcam. If you are going to print the images and display in a video, you need to handle several things. it is necessary to define patches on your big image (input from the webcam) in which you build a feature representation for each patch and classify in the same way you did in the previous step. For doing that, you could slide a window and classify all the patches to see if they belong to the negative class or to one of the positive ones. There are other alternatives.
Scale: Considering that you are able to detect the location of images in the big image and classify it, the next step is to relax the toy assumption of fixes scale. To handle a multiscale approach, you could image pyramid which pretty much allows you to perform the detection in multiresolution. Alternative approaches could consider keypoint detectors, such as SIFT and SURF. Inside SIFT, there is an image pyramid which allows the invariance.
Projection So far we assumed that you had images under orthographic projection, but most likely you will have slight perspective projections which will make the whole previous assumption fail. One naive solution for that would be for instance detect the corners of the white background of your image and rectify the image before building the feature vector for classification. If you used SIFT or SURF, you could design a way of avoiding explicitly handling that. Nevertheless, if your input is gonna be just squares patches, such as in ARToolkit, I would go for manual rectification.
I hope I might have given you a better picture of your problem.
I would recommend using SURF for that, because pictures can be on different distances form your camera, i.e changing the scale. I had one similar experiment and SURF worked just as expected. But SURF has very difficult adjustment (and expensive operations), you should try different setups before you get the needed results.
Here is a link: http://docs.opencv.org/modules/nonfree/doc/feature_detection.html
youtube video (in C#, but can give an idea): http://www.youtube.com/watch?v=zjxWpKCQqJc
I might not be qualified enough to answer this problem. Last time I seriously use OpenCV it was still 1.1. But just some thought on it, and hope it would help (currently I am interested in DIP and ML).
I think it will probably an easier task if you only need to classify an image, if the image is just one from (or very similar to) your 500 images. For this you could use SVM or some neural network (Felix already gave an excellent enumeration on that).
However, your problem seems to be that you need to first find this candidate image in your webcam, the location of which you have little clue beforehand. (let us know whether it is so. I think it is important.)
If so, the harder problem is the detection/localization of your candidate image.
I don't have a general solution for that. The first thing I would do is to see if there is some common feature in your 500 images (e.g., whether all of them enclosed by a red circle, or, half of them have circle and half of them have rectangle). If this can be done, the problem will be simpler (it would be similar to face detection problem, which have good solution).
In other words, this means that you first classify the 500 images to a few groups with common feature (by human), and detect the group first, then scale and use above mentioned technique to classify them into fine result. In this way, it will be more computationally acceptable than trying to detect 500 images one by one.
BTW, this ppt would help to give a visual clue of what is going on for feature extraction and image matching http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf.
Detect vs recognize: detecting the image is just finding it on the background and from your comments I realized you may have your sings surrounded by the background. It might facilitate your algorithm if you can somehow crop your signs from the background (detect) before trying to recognize them. Recognizing is a next stage that presumes you can classify the cropped image correctly as the one seen before.
If you need real time speed and scale/rotation invariance neither SIFT no SURF will do this fast. Nowadays you can do much better if you shift the burden of image processing to a learning stage as was done by Lepitit. In short, he subjected each pattern to a bunch of affine transformations and trained a binary classification tree to recognize each point correctly by doing a lot of binary comparison tests. Trees are extremely fast and a way to go not to mention that most of the processing is done offline. This method is also more robust to off-plane rotations than SIFT or SURF. You will also learn about tree classification which may facilitate you last processing stage.
Finally a recognition stage is based not only on the number of matches but also on their geometric consistency. Since your signs look flat I suggest finding either affine or homography transformation that has most inliers when calculated between matched points.
Looking at your code though I realized that you may not follow any of these recommendations. It may be a good starting point for you to read about decision trees and then play with some sample code (see mushroom.cpp in the above mentioned link)

What the math behind such animation trajectories?

What's the math behind something like this? C++ perspective.
More examples on this MSDN page here.
UPDATE: Was asked for a more concrete question. What's the math/animation theory for Penner's tweens^? How do you come up with those formulas? What are the math principles they are based on?
Me and math, we are not BFFs! I'm working on a multi-FLOAT value animator for a UI thing I'm writing and I was wondering what's the math from a native C++ programmer's point of view for generating such a trajectory.
Googled and found code but I'm also looking for a bit of theory from a programming perspective... not just code or pure math. I can whip the code I need together from what I found online but I'd like to understand it in the process. Like this site that allows one to experiment with an easing function generator.
I could also use the Windows Animation Manager (and I might if things get bloody), but that operates on a single float. And just animating RGB requires animating each FLOAT by itself. It leads to huge code-bloat... very bad.
If anyone has some hints, I would very much appreciate it. I'm looking mainly for theory from a programming perspective. The end goal is to write a bunch of different animation algorithms that can animate a set of FLOATs from their initial values to their target values in a period of time or speed and such.
The plan is not just to get the code written, but also to understand what goes on behind it. And then maybe get creative with this animations... unless these prove to be some rigid standard math functions.
So think of the requirements for a tweening function.
The function should represent a valid smooth motion between two positions/states. For those who haven't read the relevant section of the book, this means that f should be a continuous and differentiable function such that f(0) == 0 and f(1) == 1; actual motions are constructed using this as a primitive.
"Ease" (in the animation tweening sense) means "derivative is zero"; this gives the effect that the motion starts and/or ends with zero velocity (i.e. a standstill). So "ease-in" means f'(0) == 0 and "ease-out" means f'(1) == 0.
Everything else is based on aesthetic considerations.
Cubic curves (e.g. Bezier/Hermite splines) are popular partly because they let you control both the position and tangent(speed) at both ends of the curve, but also because they are close to the natural shape that a flexible beam adopts if you constrain its position it at a few points. The cubic shape minimises the internal stress of a flexed beam. (Unsurprisingly, these wooden beams are known to boat designers and other drafters as "splines", for this is where we get the word.)
Historically, hand-drawn cartoon animators have always specified their tweens by feel, based on experience. Key animators draw a chart (called a "timing chart"; look this up on your favourite image search engine) on the side of their key drawings, which tell the inbetweeners how the intermediate cels should be timed.
Camera motion (pan, zoom, rotate), however, were a different matter. Layout/animation artists specified the start and the end of the motion (specified using a field chart), the number of frames over which the motion would happen, instructions on easing and anything else the layout/animation team felt important (e.g. if you had to "linger").
The actual motions needed to be calculated; the audience would notice if one frame of a rotation was out even by a couple of hundredths of a degree. Doing these calculations was part of the job of the camera department.
There's a wonderful book called "Basic Animation Stand Techniques" by Brian Salt which dates from back in the days of physical animation cameras, and describes in some detail the sort of thing they had to do, and to what extent. I recommend it if you're at all interested in this stuff.
Math is math is math.
A good tutorial on Riemann Sum will demonstrate the concept.
In fundamental programming, you have a math equation that generates a Y value (height) for a given X (time). Periodically, like once a second for example, you plug in a new X (time) value and get the height back.
The more often you evaluate this function, the better the resolution (this is where the diagrams of a Riemann sum and calculus come in). The best you will get is an approximation to the curve that looks like stair steps.
In embedded systems, there is not a lot of resources to evaluate a function like this very frequently. The curve can be approximated using line segments. The more line segments, the better the approximation (improves accuracy). So one method is to break up the curve into line segments. For a given x, use the appropriate linear formula for the line. Evaluation of a line usually takes less resources than evaluating a higher degree equation.
Your curves are usually generated from equations of Physics. So not only do you need to improve on Math, you should also improve on Physics.
Otherwise you can search the web for libraries that handle trajectories.
As we traverse closer to the hardware, a timer can be used to call a method that evaluates the trajectory function for the given X. The timer helps produce a more accurate time value.
Search the web for "curve fit algorithm", "Bresenham algorithm", "graphics collision detection algorithms"

Calibrating camera with a glass-covered checkerboard

I need to find intrinsic calibration parameters of a single. To do this I take several images of checkerboard patten from different angles and then use calibration software.
To make the calibration pattern as flat as possible, I print it on a paper and cover with a 3mm glass. Obviously image of the pattern is modified by glass, because it has a different refraction coefficient compared to air.
Extrinsic parameters will be distorted by the glass. This is because checkerboard is not in place we see it in. However, if thickness of the glass and refraction coefficients of glass and air are known, it seems to be possible to recover extrinsic parameters.
So, the questions are:
Can extrinsic parameters be calculated, and if yes, then how? (This is not necessary right now, just an interesting theoretical question)
Are intrinsic calibration parameters obtained from these images equivalent to ones obtained from a usual calibration procedure (without cover glass)?
By using a glass, calibration parameters as reported by GML Camera Calibration Toolbox (based on OpenCV), become much more accurate. (Does it make any sense at all?) But this approach has a little drawback - unwanted reflections, especially from light sources.
I commend you on choosing a very flat support (which is what I recommend myself here). But, forgive me for asking the obvious question, why did you cover the pattern with the glass?
Since the point of the exercise is to ensure the target's planarity and nothing else, you might as well glue the side opposite to the pattern of the paper sheet and avoid all this trouble. Yes, in time the pattern will get dirty and worn and need replacement. So you just scrape it off and replace it: printing checkerboards is cheap.
If, for whatever reasons, you are stuck with the glass in the front, I recommend doing first a back-of-the-envelope calculation of the expected ray deflection due to the glass refraction, and check if it is actually measurable by your apparatus. Given the nominal focal length in mm of the lens you are using and the physical width and pixel density of the sensor, you can easily work it out at the image center, assuming an "extreme" angle of rotation of the target w.r.t the focal axis (say, 45 deg), and a nominal distance. To a first approximation, you may model the pattern as "painted" on the glass, and so ignore the first refraction and only consider the glass-to-air one.
If the above calculation suggests that the effect is measurable (deflection >= 1 pixel), you will need to add the glass to your scene model and solve for its parameters in the bundle adjustment phase, along with the intrinsics and extrinsics. To begin with, I'd use two parameters, thickness and refraction coefficient, and assume both faces are really planar and parallel. It will just make the computation of the corner projections in the cost function a little more complicated, as you'll have to take the ray deflection into account.
Given the extra complexity of the cost function, I'd definitely write the model's code to use Automatic Differentiation (AD).
If you really want to go through this exercise, I'd recommend writing the solver on top of Google Ceres bundle adjuster, which supports AD, among many nice things.

Which deconvolution algorithm is best suited for removing motion blur from text?

I'm using OpenCV to process pictures taken with a mobile phone. The pictures contain text, and they have small amounts of motion blur, which I need to remove.
What would be the most viable algorithm to use? I have tested so far Lucy-Richardson and Weiner deconvolution, but they did not yield satisfactory results.
Agree with #TheJuice, your problem lies in the PSF estimation. Usually to be able to do this from a single frame, several assumptions need to be made about the factors leading to the blur (motion of object, type of motion of the sensor, etc.).
You can find some pointers, especially on the monodimensional case, here. They use a filtering method that leaves mostly correlation from the blur, discarding spatial correlation of original image, and use this to deduce motion direction and thence the PSF. For small blurs you might be able to consider the motion as constant; otherwise you will have to use a more complex accelerated motion model.
Unfortunately, mobile phone blur is often a compound of CCD integration and non-linear motion (translation perpendicular to line of sight, yaw from wrist motion, and rotation around the wrist), so Yitzhaky and Kopeika's method will probably only yield acceptable results in a minority of cases. I know there are methods to deal with that ("depth awareness" and other) but I have never had occasion of dealing with them.
You can preview the results using photo recovery software such as Focus Magic; while they do not employ YK estimator (motion description is left to you), the remaining workflow is necessarily very similar. If your pictures are amenable to Focus Magic recovery, then probably YK method will work. If they are not (or not enough, or not enough of them to be worthwhile), then there's no point even trying to implement it.
Motion blur is a difficult problem to overcome. The best results are gained when
The speed of the camera relative to the scene is known
You have many pictures of the blurred object which you can correlate.
You do have one major advantage in that you are looking at text (which normally constitutes high contrast features). If you only apply deconvolution to high contrast (I know that the theory is often to exclude high contrast) areas of your image you should get results which may enable you to better recognise characters. Also a combination of sharpening/blurring filters pre/post processing may help.
I remember being impressed with this paper previously. Perhaps an adaption on their implementation would be worth a go.
I think the estimation of your point-spread function is likely to be more important than the algorithm used. It depends on the kind of motion blur you're trying to remove, linear motion is likely to be the easiest but is unlikely to be the kind you're trying to remove: i imagine it's non-linear caused by hand movement during the exposure.
You cannot eliminate motion blur. The information is lost forever. What you are dealing with is a CCD that is recording multiple real objects to a single pixel, smearing them together. In other words if the pixel reads 56, you cannot magically determine that the actual reading should have been 37 at time 1, and 62 at time 2, and 43 at time 3.
Another way to look at this: imagine you have 5 pictures. You then use photoshop to blend the pictures together, averaging the value of each pixel. Can you now somehow from the blended picture tell what the original 5 pictures were? No, you cannot, because you do not have the information to do that.