Horn-Schunck optical flow calculations - computer-vision

I was studying the Horn-schunck method for calculating optical flow in videos. My code is in C, which would mean i am implementing all of the algorithms from scratch including gray-scaling the image, computing derivatives etc. I am not able to completely absorb the essence of the method. The final flow matrix that I get would contain displacement vectors for each pixel, right? Meaning for each pixel, the value in the flow matrix would indicate the amount by which it is displaced in the next image.
How does this work out when I have all pixel values between 0-255, all my calculations are done on these pixel values and the resulting output gives displacement in, say, a 1920 X 1080 image.

The result of your method would be a matrix with two channels or two matrix, one for the u (or dx) direction/displacement and the other for the v (or dy) direction/displacement. That means you have a vector field
[u(x,y) v(x,y] = optical flow for each position (x,y) in your image
this vector field (the values of this field) have floating precision. That means e.g. u(0,0) = 0.2 v(0,0) = 0.13. Consequently in one part of your coud you have transform the gray-values of your input image into floating values. This is mostly done when you are compution the gradients e.g. with the sobel operator. The OpenCV library has a Horn-Schunk implementation. While reading the code takes some time, but you can be assured that this a very efficient way to implement this method.

Related

Optical Flow: What exactly is the temporal derivative?

I'm trying to understand what the meaning of a temporal derivative is in an image. While I understand the brightness constancy equation, I don't understand why taking the difference between two images gives me the temporal derivative.
Taking the difference between two frames gives me the difference in pixel intensity per pixel between the two, but how is that the same as asking how much the image changed over a certain span of time?
The temporal derivative dI/dt of the image I(x,y,t) is the rate of change of the image over time at a particular position. As you noted, this is the difference in pixel intensity between the two frames. Considering a single pixel at (x,y), the finite difference approximation to the derivative is
f_d = ( I(x,y,t+delta) - I(x,y,t) ) / delta so that f_d -> dI/dt as delta -> 0.
In this case delta is simply set to one. So we are approximating the image derivative (with respect to time) by the difference between adjacent frames.
One aspect that may be confusing is how that relates to the movement of objects in the image. If you have some physics background, for instance, you might think about the difference between Eulerian and Lagrangian frames of reference: in the more intuitive Lagrangian viewpoint, you consider an object moving by tracking it over the pixels (space) in which it moves, e.g. watching a cat as it hops over a fence. The Eulerian view, which is closer to what we do in optical flow, is to track what happens at a single pixel, and never take our eyes off of it. As the cat passes over that area of (pixel) space, the pixel's values will change, and then go back to "normal" when it's gone.
These two views are in some sense equivalent, but may be useful in difference situations. In computer vision, tracking an object is hard, while computing these Eulerian-like temporal derivatives is easy. Ideally, we could track the cat: consider a point p(t)=(x_p(t),y_p(t)) on say its head, then compute dp/dt and figure out p(t) for all t, and use that for downstream processing. Unfortunately, this is hard, so instead we hope that brightness constancy is usually locally true, and use the optical flow to estimate dp/dt. Of course, dI/dt often does not correspond well to dp/dt (this is why the brightness constancy is an assumption). For instance, consider a light moving around a stationary sphere: dI/dt will be large, but dp/dt will be zero.
The difference between subsequent frames is the finite difference approximation to the temporal derivative.
Proper units would be obtained if the value were divided by the time between frames (i.e. multiplied by the frames per second value).

Calculating the precision of homography on 2D plane

I am trying to find a way to parametrize the precision of my homography calculation. I would like to obtain a value that describes the precision of the homography calculation for a measurement taken at a certain position.
I currently have succesfully calculated the homography (with cv::findHomography) and I can use it to map a point on my camera image onto a 2D map (using cv::perspectiveTransform). Now I want to track these objects on my 2D map and to do this I want to take in account that objects that are in the back of my camera image have a less precise position on my 2D map than the objects that are all the way in the front.
I have looked at the following example on this website that mentions plane fitting but I don't really understand how to fill the matrices correctly using this method. The visualisation of the result does seem to fit my needs. Is there any way to do this with standard OpenCV functions?
EDIT:
Thanks Francesco for your recommendations. But, I think I am looking for something different than your answer. I am not looking to test the precision of the homography itself, but the relation between the density of measurements in one real camera view and the actual size on a map I create. I want to know that when I am 1 pixel off on my detection in the camera image, how many meters this will be on my map at this point.
I can of course calculate by taking some pixels around my measurement on my camera image and then use the homography to see how many meters on my map this represent every time I do a homography, but I don't want to calculate this every time. What I would like is to have a formula that tells me the relation between pixels in my image and pixels on my map so I can take this in account for my tracking on the map.
What you are looking for is called "predictive error bars" or "prediction uncertainty". You should definitely consult a good introductory book on estimation theory for details (e.g. this one). But briefly, the predictive uncertainty is the probability that...
A certain pixel p in image 1 will is the mapping H(p') of a pixel p' in image 2 under the homography H...
Given the uncertainty in H which is due to the errors in the matched pairs (q0, q0'), (q1, q1'), ..., that have been used to estimate H, ...
But assuming the model is correct, that is, that the true map between images 1 and 2 is, in fact, a homography (although the estimated parameters of the homography itself may be affected by errors).
In order to estimate this probability distribution you'll need a model for the errors in the measurements, and a model for how they propagate through the (homography) model.

Gaussian Blur with FFT Questions

I have a current implementation of Gaussian Blur using regular convolution. It is efficient enough for small kernels, but once the kernels size gets a little bigger, the performance takes a hit. So, I am thinking to implement the convolution using FFT. I've never had any experience with FFT related image processing so I have a few questions.
Is a 2D FFT based convolution also separable into two 1D convolutions ?
If true, does it go like this - 1D FFT on every row, and then 1D FFT on every column, then multiply with the 2D kernel and then inverse transform of every column and the inverse transform of every row? Or do I have to multiply with a 1D kernel after each 1D FFT Transform?
Now I understand that the kernel size should be the same size as the image (row in case of 1D). But how will it affect the edges? Do I have to pad the image edges with zeros? If so the kernel size should be equal to the image size before or after padding?
Also, this is a C++ project, and I plan on using kissFFT, since this is a commercial project. You are welcome to suggest any better alternatives. Thank you.
EDIT: Thanks for the responses, but I have a few more questions.
I see that the imaginary part of the input image will be all zeros. But will the output imaginary part will also be zeros? Do I have to multiply the Gaussian kernel to both real and imaginary parts?
I have instances of the same image to be blurred at different scales, i.e. the same image is scaled to different sizes and blurred at different kernel sizes. Do I have to perform a FFT every time I scale the image or can I use the same FFT?
Lastly, If I wanted to visualize the FFT, I understand that a log filter has to be applied to the FFT. But I am really lost on which part should be used to visualize FFT? The real part or the imaginary part.
Also for an image of size 512x512, what will be the size of real and imaginary parts. Will they be the same length?
Thank you again for your detailed replies.
The 2-D FFT is seperable and you are correct in how to perform it except that you must multiply by the 2-D FFT of the 2D kernel. If you are using kissfft, an easier way to perform the 2-D FFT is to just use kiss_fftnd in the tools directory of the kissfft package. This will do multi-dimensional FFTs.
The kernel size does not have to be any particular size. If the kernel is smaller than the image, you just need to zero-pad up to the image size before performing the 2-D FFT. You should also zero pad the image edges since the convoulution being performed by multiplication in the frequency domain is actually circular convolution and results wrap around at the edges.
So to summarize (given that the image size is M x N):
come up with a 2-D kernel of any size (U x V)
zero-pad the kernel up to (M+U-1) x (N+V-1)
take the 2-D fft of the kernel
zero-pad the image up to (M+U-1) x (N+V-1)
take the 2-D FFT of the image
multiply FFT of kernel by FFT of image
take inverse 2-D FFT of result
trim off garbage at edges
If you are performing the same filter multiple times on different images, you don't have to perform 1-3 every time.
Note: The kernel size will have to be rather large for this to be faster than direct computation of convolution. Also, did you implement your direct convolution taking advantage of the fact that a 2-D gaussian filter is separable (see this a few paragraphs into the "Mechanics" section)? That is, you can perform the 2-D convolution as 1-D convolutions on the rows and then the columns. I have found this to be faster than most FFT-based approaches unless the kernels are quite large.
Response to Edit
If the input is real, the output will still be complex except for rare circumstances. The FFT of your gaussian kernel will also be complex, so the multiply must be a complex multiplication. When you perform the inverse FFT, the output should be real since your input image and kernel are real. The output will be returned in a complex array, but the imaginary components should be zero or very small (floating point error) and can be discarded.
If you are using the same image, you can reuse the image FFT, but you will need to zero pad based on your biggest kernel size. You will have to compute the FFTs of all of the different kernels.
For visualization, the magnitude of the complex output should be used. The log scale just helps to visualize smaller components of the output when larger components would drown them out in a linear scale. The Decibel scale is often used and is given by either 20*log10(abs(x)) or 10*log10(x*x') which are equivalent. (x is the complex fft output and x' is the complex conjugate of x).
The input and output of the FFT will be the same size. Also the real and imaginary parts will be the same size since one real and one imaginary value form a single sample.
Remember that convolution in space is equivalent to multiplication in frequency domain. This means that once you perform FFT of both image and mask (kernel), you only have to do point-by-point multiplication, and then IFFT of the result. Having said that, here are a few words of caution.
You probably know that in digital signal processing, we often use circular convolution, not linear convolution. This happens because of curious periodicity. What this means in simple terms is that DFT (and FFT which is its computationally efficient variant) assumes that you signal is periodic, and when you filter your signal in such manner -- suppose your image is N x M pixels -- that it takes pixel at (1,m) to the the neighbor or pixel at (N, m) for some m<M. You signal virtually wraps around onto itself. This means that your Gaussian mask will be averaging pixels on the far right with pixels on the far left, and same goes for top and bottom. This might or might not be desired, but in general one has to deal with edging artifacts anyway. It is however much easier to forget about this issue when dealing with FFT multiplication because the problem stops being apparent. There are many ways to take care of this problem. The best way is to simply pad your image with zeros and remove the extra pixels later.
A very neat thing about using a Gaussian filter in frequency domain is that you never really have to take its FFT. It si a well-know fact that Fourier transform of a Gaussian is a Gaussian (some technical details here). All you would have to do then is pad you image with zeros (both top and bottom), generate a Gaussian in the frequency domain, multiply them together and take IFFT. Then you're done.
Hope this helps.

CUDA cufftPlan2d plan size question

I'm studying the code behind the convolutionFFT2D example of the Nvidia CUDA sdk, but I don't get the point of this line:
cufftPlan2d(&fftPlan, fftH, fftW/2, CUFFT_C2C);
Apparently this initializes a complex plane for the FFT to be running in, but I don't see the point of dividing the plan width by 2.
Just to be precise: the fftH and fftW are rounded values for imageX+kernelX+1 and imageY+kernelY+1 dimensions (just for speed reasons). I know that in the frequency domain you usually have a positive component and a symmetric negative component of the same frequency.. but this sounds like cutting half of my image data away..
Can someone explain this to me a little better? I've never used a FFT (I just know the theory behind a fourier transformation)
When you perform a real to complex FFT half the frequency domain data is redundant due to symmetry. This is only the case in one axis of a 2D FFT though. You can think of a 2D FFT as two 1D FFT operations, the first operates on all the rows, and for a real valued image this will give you complex row values. In the second stage you apply a 1D FFT to every column, but since the row values are now complex this will be a complex to complex FFT with no redundancy in the output. Hence you only need width / 2 points in the horizontal axis, but you still need height pointe in the vertical axis.

Convert Polar Image to a Cartesian Image

I am attempting to convert an image in polar coordinates (axes are angle x radius) to an image in cartesian coordinates (axes are x and y).
This is simple enough in matlab using pcolor() but the issue is that I must do this in a mex file (c++ interface to Matlab). This seem's easy enough except that Matlab ONLY uses array containers so I can't think of a clever or eloquent way of doing this.
I do have access to the image dimensions and I can imagine a very messy way of repackaging the input image array as a matrix in C++ and carying out the conversion but this would be messy and problematic.
Also, I need to be able to interpolate gaps between points in the xy plain.
Any ideas?
This is reasonably standard in image processing, particularly in registration. However, it takes some thought and isn't "obvious". It wasn't obvious to me the first time either.
I'm assuming you have two images, in different "domains", in your case a source image in polar coordinates and a target image in Cartesian coordinates. I'm assuming you know the region in the target image you want to populate.
The commonly known best thing to do in image processing is to loop over coordinates in the known area of the target image that you want to populate. For each of these positions (x,y), you'll have some conversion to polar. It's probably r = sqrt(x*x+y*y) and theta = atan2(y,x) or something like that. Then you sample from that position in the polar coordinate position with interpolation.
Among choices of interpolation are:
Nearest neighbor - you just round to the nearest r and theta and choose the value of that.
Bilinear -
Bi-cubic
...
Of course you should take care of boundary conditions and what happens if your r and theta go out of your image.
This procedure also is similar (looping over the target image and sampling from the source image, and doing lookups based on the reverse transform) for all kinds of coordinates transformations. The nice thing is that you don't leave holes where your source imagine is relevant.
Hope this helps with the image part.
As for the mex part, here's some links:
Mex tutorial
Mex tutorial
Can you be more specific about what you need about the mex part?