Optical Flow: What exactly is the temporal derivative? - computer-vision

I'm trying to understand what the meaning of a temporal derivative is in an image. While I understand the brightness constancy equation, I don't understand why taking the difference between two images gives me the temporal derivative.
Taking the difference between two frames gives me the difference in pixel intensity per pixel between the two, but how is that the same as asking how much the image changed over a certain span of time?

The temporal derivative dI/dt of the image I(x,y,t) is the rate of change of the image over time at a particular position. As you noted, this is the difference in pixel intensity between the two frames. Considering a single pixel at (x,y), the finite difference approximation to the derivative is
f_d = ( I(x,y,t+delta) - I(x,y,t) ) / delta so that f_d -> dI/dt as delta -> 0.
In this case delta is simply set to one. So we are approximating the image derivative (with respect to time) by the difference between adjacent frames.
One aspect that may be confusing is how that relates to the movement of objects in the image. If you have some physics background, for instance, you might think about the difference between Eulerian and Lagrangian frames of reference: in the more intuitive Lagrangian viewpoint, you consider an object moving by tracking it over the pixels (space) in which it moves, e.g. watching a cat as it hops over a fence. The Eulerian view, which is closer to what we do in optical flow, is to track what happens at a single pixel, and never take our eyes off of it. As the cat passes over that area of (pixel) space, the pixel's values will change, and then go back to "normal" when it's gone.
These two views are in some sense equivalent, but may be useful in difference situations. In computer vision, tracking an object is hard, while computing these Eulerian-like temporal derivatives is easy. Ideally, we could track the cat: consider a point p(t)=(x_p(t),y_p(t)) on say its head, then compute dp/dt and figure out p(t) for all t, and use that for downstream processing. Unfortunately, this is hard, so instead we hope that brightness constancy is usually locally true, and use the optical flow to estimate dp/dt. Of course, dI/dt often does not correspond well to dp/dt (this is why the brightness constancy is an assumption). For instance, consider a light moving around a stationary sphere: dI/dt will be large, but dp/dt will be zero.

The difference between subsequent frames is the finite difference approximation to the temporal derivative.
Proper units would be obtained if the value were divided by the time between frames (i.e. multiplied by the frames per second value).

Related

Linear Interpolation and Object Collision

I have a physics engine that uses AABB testing to detect object collisions and an animation system that does not use linear interpolation. Because of this, my collisions act erratically at times, especially at high speeds. Here is a glaringly obvious problem in my system...
For the sake of demonstration, assume a frame in our animation system lasts 1 second and we are given the following scenario at frame 0.
At frame 1, the collision of the objects will not bet detected, because c1 will have traveled past c2 on the next draw.
Although I'm not using it, I have a bit of a grasp on how linear interpolation works because I have used linear extrapolation in this project in a different context. I'm wondering if linear interpolation will solve the problems I'm experiencing, or if I will need other methods as well.
There is a part of me that is confused about how linear interpolation is used in the context of animation. The idea is that we can achieve smooth animation at low frame rates. In the above scenario, we cannot simply just set c1 to be centered at x=3 in frame 1. In reality, they would have collided somewhere between frame 0 and frame 1. Does linear interpolation automatically take care of this and allow for precise AABB testing? If not, what will it solve and what other methods should I look into to achieve smooth and precise collision detection and animation?
The phenomenon you are experiencing is called tunnelling, and is a problem inherent to discrete collision detection architectures. You are correct in feeling that linear interpolation may have something to do with the solution as it can allow you to, within a margin of error (usually), predict the path of an object between frames, but this is just one piece of a much larger solution. The terminology I've seen associated with these types of solutions is "Continuous Collision Detection". The topic is large and gets quite complex, and there are books that discuss it, such as Real Time Collision Detection and other online resources.
So to answer your question: no, linear interpolation on its own won't solve your problems*. Unless you're only dealing with circles or spheres.
What to Start Thinking About
The way the solutions look and behave are dependant on your design decisions and are generally large. So just to point in the direction of the solution, the fundamental idea of continuous collision detection is to figure out: How far between the early frame and the later frame does the collision happen, and in what position and rotation are the two objects at this point. Then you must calculate the configuration the objects will be in at the later frame time in response to this. Things get very interesting addressing these problems for anything other than circles in two dimensions.
I haven't implemented this but I've seen described a solution where you march the two candidates forward between the frames, advancing their position with linear interpolation and their orientation with spherical linear interpolation and checking with discrete algorithms whether they're intersecting (Gilbert-Johnson-Keerthi Algorithm). From here you continue to apply discrete algorithms to get the smallest penetration depth (Expanding Polytope Algorithm) and pass that and the remaining time between the frames, along to a solver to get how the objects look at your later frame time. This doesn't give an analytic answer but I don't have knowledge of an analytic answer for generalized 2 or 3D cases.
If you don't want to go down this path, your best weapon in the fight against complexity is assumptions: If you can assume your high velocity objects can be represented as a point things get easier, if you can assume the orientation of the objects doesn't matter (circles, spheres) things get easier, and it keeps going and going. The topic is beyond interesting and I'm still on the path of learning it, but it has provided some of the most satisfying moments in my programming period. I hope these ideas get you on that path as well.
Edit: Since you specified you're working on a billiard game.
First we'll check whether discrete or continuous is needed
Is any amount of tunnelling acceptable in this game? Not in billiards
no.
What is the speed at which we will see tunnelling? Using a 0.0285m
radius for the ball (standard American) and a 0.01s physics step, we
get 2.85m/s as the minimum speed that collisions start giving bad
response. I'm not familiar with the speed of billiard balls but that
number feels too low.
So just checking on every frame if two of the balls are intersecting is not enough, but we don't need to go completely continuous. If we use interpolation to subdivide each frame we can increase the velocity needed to create incorrect behaviour: With 2 subdivisions we get 5.7m/s, which is still low; 3 subdivisions gives us 8.55m/s, which seems reasonable; and 4 gives us 11.4m/s which feels higher than I imagine billiard balls are moving. So how do we accomplish this?
Discrete Collisions with Frame Subdivisions using Linear Interpolation
Using subdivisions is expensive so it's worth putting time into candidate detection to use it only where needed. This is another problem with a bunch of fun solutions, and unfortunately out of scope of the question.
So you have two candidate circles which will very probably collide between the current frame and the next frame. So in pseudo code the algorithm looks like:
dt = 0.01
subdivisions = 4
circle1.next_position = circle1.position + (circle1.velocity * dt)
circle2.next_position = circle2.position + (circle2.velocity * dt)
for i from 0 to subdivisions:
temp_c1.position = interpolate(circle1.position, circle1.next_position, (i + 1) / subdivisions)
temp_c2.position = interpolate(circle2.position, circle2.next_position, (i + 1) / subdivisions)
if intersecting(temp_c1, temp_c2):
intersection confirmed
no intersection
Where the interpolate signature is interpolate(start, end, alpha)
So here you have interpolation being used to "move" the circles along the path they would take between the current and the next frame. On a confirmed intersection you can get the penetration depth and pass the delta time (dt / subdivisions), the two circles, the penetration depth and the collision points along to a resolution step that determines how they should respond to the collision.

OpenCV - PCA analysis on BW image - area vs shape - is there already an implementation?

The shape of an object is detected on a bw image. The object is a black continuous shape, the background is white.
We use PCA (http://docs.opencv.org/3.1.0/d1/dee/tutorial_introduction_to_pca.html) to get the object direction and align the object. Currently the shape itself (the points on the contour) is the input to the opencv PCA implementation. This usually works very well. But from time to time there is small dirt on the object border, causing the shape to pass around the dirt. This causes more points and more weight on one side, slightly turning the object.
Idea: Instead of the contour, we use the area of the object as input for our PCA analysis. The issue there, to check all points on if they are inside the contour and then use them for PCA slows the application down. This part will be about 52352 times slower.
New Approach: We take random points in the image, check if they are inside the shape and if so, use them for our PCA. We have to see if we can get the consistent quality needed from this approach.
Is there already a similar implementation in opencv which is using the area instead of the shape?
Another approach would be to put a mesh over the object and use the mesh points inside the object for PCA.
Is there already something similar available one can just use or does one quickly need to implement something like this?
Going for straight lines around the object isn't an option.
Given that we have received very limited information about your problem (posting images would help a lot) and you do not seem to know the probability density function of the noise, your best bet is to consider the noise to be Gaussian.
As such, and following your intuition, my suggested approach is to take a few (by a few I mean statistically relevant but not raising the computation time that much) random points that lie inside the object and compute the PCA.
Repeat this procedure in an iterative loop and store somewhere the resulting rotation angles you get from the application of the PCA to the object shape.
Stop once you have enough point, compute the mean of the rotation angles: this is a decent estimate of the true angle. Compute also the standard deviation to get a measure of the quality of your estimation. By "enough points" you can consider that ~30 points is usually considered to be "enough" for being representative of the underlying population according to the central limit theorem.
If you want, you can improve on this approach in many ways, for example doing robust estimation of the true angle once you have collected enough points. It all depends on the data you have at hand...take my suggestion just as a starting point.
There are few parameters that you could change, in which may improve your system.
First is the threshold you use to binarize your image. I don't know what your application is about, but you could use other color systems, or normalize your image by cromacity, and after that, apply the new threshold.
Other aspect is to exclude shapes (contours) that have bigger or smaller area that what you are expecting.
To add up, you may use a blur filter before detect contours.
I don't know how the noise looks but when you say "small dirt" I think it might be only some few pixels that is a lot smaller then the object it self, but it might be attached to the object. To reduce this noise it might be possible to perform an opening (morphology) on the binary image.
http://docs.opencv.org/2.4/doc/tutorials/imgproc/opening_closing_hats/opening_closing_hats.html

OpenCV Image stiching when camera parameters are known

We have pictures taken from a plane flying over an area with 50% overlap and is using the OpenCV stitching algorithm to stitch them together. This works fine for our version 1. In our next iteration we want to look into a few extra things that I could use a few comments on.
Currently the stitching algorithm estimates the camera parameters. We do have camera parameters and a lot of information available from the plane about camera angle, position (GPS) etc. Would we be able to benefit anything from this information in contrast to just let the algorithm estimate everything based on matched feature points?
These images are taken in high resolution and the algorithm takes up quite amount of RAM at this point, not a big problem as we just spin large machines up in the cloud. But I would like to in our next iteration to get out the homography from down sampled images and apply it to the large images later. This will also give us more options to manipulate and visualize other information on the original images and be able to go back and forward between original and stitched images.
If we in question 1 is going to take apart the stitching algorithm to put in the known information, is it just using the findHomography method to get the info or is there better alternatives to create the homography when we actually know the plane position and angles and the camera parameters.
I got a basic understanding of opencv and is fine with c++ programming so its not a problem to write our own customized stitcher, but the theory is a bit rusty here.
Since you are using homographies to warp your imagery, I assume you are capturing areas small enough that you don't have to worry about Earth curvature effects. Also, I assume you don't use an elevation model.
Generally speaking, you will always want to tighten your (homography) model using matched image points, since your final output is a stitched image. If you have the RAM and CPU budget, you could refine your linear model using a max likelihood estimator.
Having a prior motion model (e.g. from GPS + IMU) could be used to initialize the feature search and match. With a good enough initial estimation of the feature apparent motion, you could dispense with expensive feature descriptor computation and storage, and just go with normalized crosscorrelation.
If I understand correctly, the images are taken vertically and overlap by a known amount of pixels, in that case calculating homography is a bit overkill: you're just talking about a translation matrix, and using more powerful algorithms can only give you bad conditioned matrixes.
In 2D, if H is a generalised homography matrix representing a perspective transformation,
H=[[a1 a2 a3] [a4 a5 a6] [a7 a8 a9]]
then the submatrixes R and T represent rotation and translation, respectively, if a9==1.
R= [[a1 a2] [a4 a5]], T=[[a3] [a6]]
while [a7 a8] represents the stretching of each axis. (All of this is a bit approximate since when all effects are present they'll influence each other).
So, if you known the lateral displacement, you can create a 3x3 matrix having just a3, a6 and a9=1 and pass it to cv::warpPerspective or cv::warpAffine.
As a criteria of matching correctness you can, f.e., calculate a normalized diff between pixels.

Opencv - How to differentiate jitter from panning?

I'm working on a video stabilizer using Opencv in C++.
At this time of the project I'm correctly able to find the translation between two consecutive frames with 3 different technique (Optical Flow, Phase Correlation, BFMatcher on points of interests).
To obtain a stabilized image I add up all the translation vector (from consecutive frame) to one, which is used in warpAffine function to correct the output image.
I'm having good result on fixed camera but result on camera in translation are really bad : the image disappear from the screen.
I think I have to distinguish jitter movement that I want to remove from panning movement that I want to keep. But I'm open to others solutions.
Actually the whole problem is a bit more complex than you might have thought in the beginning. Let's look a it this way: when you move your camera through the world, things that move close to the camera move faster than the ones in the background - so objects at different depths change their relative distance (look at your finder while moving the head and see how it points to different things). This means the image actually transforms and does not only translate (move in x or y) - so how do you want to accompensate for that? What you you need to do is to infer how much the camera moved (translation along x,y and z) and how much it rotated (with the angles of yaw, pan and tilt). This is a not very trivial task but openCV comes with a very nice package: http://opencv.willowgarage.com/documentation/camera_calibration_and_3d_reconstruction.html
So I recommend you to read as much on Homography(http://en.wikipedia.org/wiki/Homography), camera models and calibration as possible and then think what you actually want to stabilize for and if it is only for the rotation angles, the task is much simpler than if you would also like to stabilize for translational jitters.
If you don't want to go fancy and neglect the third dimension, I suggest that you average the optic flow, high-pass filter it and compensate this movement with a image translation into the oposite direction. This will keep your image more or less in the middle of the frame and only small,fast changes will be counteracted.
I would suggest you the possible approaches (in complexity order):
apply some easy-to-implement IIR low pass filter on the translation vectors before applying the stabilization. This will separate the high frequency (jitter) from the low frequency (panning)
same idea, a bit more complex, use Kalman filtering to track a motion with constant velocity or acceleration. You can use OpenCV's Kalman filter for that.
A bit more tricky, put a threshold on the motion amplitude to decide between two states (moving vs static camera) and filter the translation or not.
Finaly, you can use some elaborate technique from machine Learning to try to identify the user's desired motion (static, panning, etc.) and filter or not the motion vectors used for the stabilization.
Just a threshold is not a low pass filter.
Possible low pass filters (that are easy to implement):
there is the well known averaging, that is already a low-pass filter whose cutoff frequency depends on the number of samples that go into the averaging equation (the more samples the lower the cutoff frequency).
One frequently used filter is the exponential filter (because it forgets the past with an exponential rate decay). It is simply computed as x_filt(k) = a*x_nofilt(k) + (1-a)x_filt(k-1) with 0 <= a <= 1.
Another popular filter (and that can be computed beyond order 1) is the Butterworth filter.
Etc Low pass filters on Wikipedia, IIR filters...

finding Image shift

How to find shift and rotation between same two images using programming languages vb.net or C++ or C#?
The problem you state is called motion detection (or motion compensation) and is one of the most important problems in image and video processing at the moment. No easy "here are ten lines of code that will do it" solution exists except for some really trivial cases.
Even your seemingly trivial case is quite a difficult one because a rotation by an unknown angle could cause slight pixel-by-pixel changes that can't be easily detected without specifically tailored algorithms used for motion detection.
If the images are very similar such that the camera is only slightly moved and rotated then the problem could be solved without using highly complex techniques.
What I would do, in that case, is use a motion tracking algorithm to get the optical flow of the image sequence which is a "map" which approximates how a pixel has "moved" from image A to B. OpenCV which is indeed a very good library has functions that does this: CalcOpticalFlowLK and CalcOpticalFlowPyrLK.
The tricky bit is going from the optical flow to total rotation of the image. I would start by heavily low pass filter the optical flow to get a smoother map to work with.
Then you need to use some logic to test if the image is only shifted or rotated. If it is only shifted then the entire map should be one "color", i.e. all flow vectors point in the same direction.
If there has been a rotation then the vectors will point in different direction depending on the rotation.
If the input images are not as nice as the above method requires, then I would look into feature descriptors to find how a specific object in the first image is located within the second. This will however be much harder.
There is no short answer. You could try to use free OpenCV library for finding relationship between two images.
The two operations, rotation and translation can be determined in either order. It's far easier to first detect rotation, because you can then compensate for that. Once both images are oriented the same, the translation becomes a matter of simmple correlation.
Finding the relative rotation of an image is best done by determining the local gradients. For every neighborhood (e.g. 3x3 pixels), treat the greyvalue as a function z(x,y), fit a plane through the 9 pixels, and determine the slope or gradient of that plane. Now average the gradient you found over the entire image, or at least the center of it. Your two images will produce different averages. Part of that is because for non-90 degree rotations the images won't overlap fully, but in general the difference in average gradients is the rotation between the two.
Once you've rotated back one image, you can determine a correlation. This is a fairly standard operation; you're essentially determining for each possible offset how well the two images overlap. This will give you an estimate for the shift.
Once you've got both, you can refine your rotation angle estimate by rotating back the translation, shifting the second image, and determining the average gradient only over the pixels common to both images.
If the images are exactly the same, it should be fairly easy to extract some feature points - for example using SIFT - and match the features of both images. You can then use any two of the matching features to find the rotation and translation. The translation is just the difference between two matching feature points. The you compensate for the translation in one image and get the rotation angle as the angle formed by the three remaining points.