Opencv - How to differentiate jitter from panning? - c++

I'm working on a video stabilizer using Opencv in C++.
At this time of the project I'm correctly able to find the translation between two consecutive frames with 3 different technique (Optical Flow, Phase Correlation, BFMatcher on points of interests).
To obtain a stabilized image I add up all the translation vector (from consecutive frame) to one, which is used in warpAffine function to correct the output image.
I'm having good result on fixed camera but result on camera in translation are really bad : the image disappear from the screen.
I think I have to distinguish jitter movement that I want to remove from panning movement that I want to keep. But I'm open to others solutions.

Actually the whole problem is a bit more complex than you might have thought in the beginning. Let's look a it this way: when you move your camera through the world, things that move close to the camera move faster than the ones in the background - so objects at different depths change their relative distance (look at your finder while moving the head and see how it points to different things). This means the image actually transforms and does not only translate (move in x or y) - so how do you want to accompensate for that? What you you need to do is to infer how much the camera moved (translation along x,y and z) and how much it rotated (with the angles of yaw, pan and tilt). This is a not very trivial task but openCV comes with a very nice package: http://opencv.willowgarage.com/documentation/camera_calibration_and_3d_reconstruction.html
So I recommend you to read as much on Homography(http://en.wikipedia.org/wiki/Homography), camera models and calibration as possible and then think what you actually want to stabilize for and if it is only for the rotation angles, the task is much simpler than if you would also like to stabilize for translational jitters.
If you don't want to go fancy and neglect the third dimension, I suggest that you average the optic flow, high-pass filter it and compensate this movement with a image translation into the oposite direction. This will keep your image more or less in the middle of the frame and only small,fast changes will be counteracted.

I would suggest you the possible approaches (in complexity order):
apply some easy-to-implement IIR low pass filter on the translation vectors before applying the stabilization. This will separate the high frequency (jitter) from the low frequency (panning)
same idea, a bit more complex, use Kalman filtering to track a motion with constant velocity or acceleration. You can use OpenCV's Kalman filter for that.
A bit more tricky, put a threshold on the motion amplitude to decide between two states (moving vs static camera) and filter the translation or not.
Finaly, you can use some elaborate technique from machine Learning to try to identify the user's desired motion (static, panning, etc.) and filter or not the motion vectors used for the stabilization.
Just a threshold is not a low pass filter.
Possible low pass filters (that are easy to implement):
there is the well known averaging, that is already a low-pass filter whose cutoff frequency depends on the number of samples that go into the averaging equation (the more samples the lower the cutoff frequency).
One frequently used filter is the exponential filter (because it forgets the past with an exponential rate decay). It is simply computed as x_filt(k) = a*x_nofilt(k) + (1-a)x_filt(k-1) with 0 <= a <= 1.
Another popular filter (and that can be computed beyond order 1) is the Butterworth filter.
Etc Low pass filters on Wikipedia, IIR filters...

Related

OpenCV Image stiching when camera parameters are known

We have pictures taken from a plane flying over an area with 50% overlap and is using the OpenCV stitching algorithm to stitch them together. This works fine for our version 1. In our next iteration we want to look into a few extra things that I could use a few comments on.
Currently the stitching algorithm estimates the camera parameters. We do have camera parameters and a lot of information available from the plane about camera angle, position (GPS) etc. Would we be able to benefit anything from this information in contrast to just let the algorithm estimate everything based on matched feature points?
These images are taken in high resolution and the algorithm takes up quite amount of RAM at this point, not a big problem as we just spin large machines up in the cloud. But I would like to in our next iteration to get out the homography from down sampled images and apply it to the large images later. This will also give us more options to manipulate and visualize other information on the original images and be able to go back and forward between original and stitched images.
If we in question 1 is going to take apart the stitching algorithm to put in the known information, is it just using the findHomography method to get the info or is there better alternatives to create the homography when we actually know the plane position and angles and the camera parameters.
I got a basic understanding of opencv and is fine with c++ programming so its not a problem to write our own customized stitcher, but the theory is a bit rusty here.
Since you are using homographies to warp your imagery, I assume you are capturing areas small enough that you don't have to worry about Earth curvature effects. Also, I assume you don't use an elevation model.
Generally speaking, you will always want to tighten your (homography) model using matched image points, since your final output is a stitched image. If you have the RAM and CPU budget, you could refine your linear model using a max likelihood estimator.
Having a prior motion model (e.g. from GPS + IMU) could be used to initialize the feature search and match. With a good enough initial estimation of the feature apparent motion, you could dispense with expensive feature descriptor computation and storage, and just go with normalized crosscorrelation.
If I understand correctly, the images are taken vertically and overlap by a known amount of pixels, in that case calculating homography is a bit overkill: you're just talking about a translation matrix, and using more powerful algorithms can only give you bad conditioned matrixes.
In 2D, if H is a generalised homography matrix representing a perspective transformation,
H=[[a1 a2 a3] [a4 a5 a6] [a7 a8 a9]]
then the submatrixes R and T represent rotation and translation, respectively, if a9==1.
R= [[a1 a2] [a4 a5]], T=[[a3] [a6]]
while [a7 a8] represents the stretching of each axis. (All of this is a bit approximate since when all effects are present they'll influence each other).
So, if you known the lateral displacement, you can create a 3x3 matrix having just a3, a6 and a9=1 and pass it to cv::warpPerspective or cv::warpAffine.
As a criteria of matching correctness you can, f.e., calculate a normalized diff between pixels.

Calibrating camera with a glass-covered checkerboard

I need to find intrinsic calibration parameters of a single. To do this I take several images of checkerboard patten from different angles and then use calibration software.
To make the calibration pattern as flat as possible, I print it on a paper and cover with a 3mm glass. Obviously image of the pattern is modified by glass, because it has a different refraction coefficient compared to air.
Extrinsic parameters will be distorted by the glass. This is because checkerboard is not in place we see it in. However, if thickness of the glass and refraction coefficients of glass and air are known, it seems to be possible to recover extrinsic parameters.
So, the questions are:
Can extrinsic parameters be calculated, and if yes, then how? (This is not necessary right now, just an interesting theoretical question)
Are intrinsic calibration parameters obtained from these images equivalent to ones obtained from a usual calibration procedure (without cover glass)?
By using a glass, calibration parameters as reported by GML Camera Calibration Toolbox (based on OpenCV), become much more accurate. (Does it make any sense at all?) But this approach has a little drawback - unwanted reflections, especially from light sources.
I commend you on choosing a very flat support (which is what I recommend myself here). But, forgive me for asking the obvious question, why did you cover the pattern with the glass?
Since the point of the exercise is to ensure the target's planarity and nothing else, you might as well glue the side opposite to the pattern of the paper sheet and avoid all this trouble. Yes, in time the pattern will get dirty and worn and need replacement. So you just scrape it off and replace it: printing checkerboards is cheap.
If, for whatever reasons, you are stuck with the glass in the front, I recommend doing first a back-of-the-envelope calculation of the expected ray deflection due to the glass refraction, and check if it is actually measurable by your apparatus. Given the nominal focal length in mm of the lens you are using and the physical width and pixel density of the sensor, you can easily work it out at the image center, assuming an "extreme" angle of rotation of the target w.r.t the focal axis (say, 45 deg), and a nominal distance. To a first approximation, you may model the pattern as "painted" on the glass, and so ignore the first refraction and only consider the glass-to-air one.
If the above calculation suggests that the effect is measurable (deflection >= 1 pixel), you will need to add the glass to your scene model and solve for its parameters in the bundle adjustment phase, along with the intrinsics and extrinsics. To begin with, I'd use two parameters, thickness and refraction coefficient, and assume both faces are really planar and parallel. It will just make the computation of the corner projections in the cost function a little more complicated, as you'll have to take the ray deflection into account.
Given the extra complexity of the cost function, I'd definitely write the model's code to use Automatic Differentiation (AD).
If you really want to go through this exercise, I'd recommend writing the solver on top of Google Ceres bundle adjuster, which supports AD, among many nice things.

Which deconvolution algorithm is best suited for removing motion blur from text?

I'm using OpenCV to process pictures taken with a mobile phone. The pictures contain text, and they have small amounts of motion blur, which I need to remove.
What would be the most viable algorithm to use? I have tested so far Lucy-Richardson and Weiner deconvolution, but they did not yield satisfactory results.
Agree with #TheJuice, your problem lies in the PSF estimation. Usually to be able to do this from a single frame, several assumptions need to be made about the factors leading to the blur (motion of object, type of motion of the sensor, etc.).
You can find some pointers, especially on the monodimensional case, here. They use a filtering method that leaves mostly correlation from the blur, discarding spatial correlation of original image, and use this to deduce motion direction and thence the PSF. For small blurs you might be able to consider the motion as constant; otherwise you will have to use a more complex accelerated motion model.
Unfortunately, mobile phone blur is often a compound of CCD integration and non-linear motion (translation perpendicular to line of sight, yaw from wrist motion, and rotation around the wrist), so Yitzhaky and Kopeika's method will probably only yield acceptable results in a minority of cases. I know there are methods to deal with that ("depth awareness" and other) but I have never had occasion of dealing with them.
You can preview the results using photo recovery software such as Focus Magic; while they do not employ YK estimator (motion description is left to you), the remaining workflow is necessarily very similar. If your pictures are amenable to Focus Magic recovery, then probably YK method will work. If they are not (or not enough, or not enough of them to be worthwhile), then there's no point even trying to implement it.
Motion blur is a difficult problem to overcome. The best results are gained when
The speed of the camera relative to the scene is known
You have many pictures of the blurred object which you can correlate.
You do have one major advantage in that you are looking at text (which normally constitutes high contrast features). If you only apply deconvolution to high contrast (I know that the theory is often to exclude high contrast) areas of your image you should get results which may enable you to better recognise characters. Also a combination of sharpening/blurring filters pre/post processing may help.
I remember being impressed with this paper previously. Perhaps an adaption on their implementation would be worth a go.
I think the estimation of your point-spread function is likely to be more important than the algorithm used. It depends on the kind of motion blur you're trying to remove, linear motion is likely to be the easiest but is unlikely to be the kind you're trying to remove: i imagine it's non-linear caused by hand movement during the exposure.
You cannot eliminate motion blur. The information is lost forever. What you are dealing with is a CCD that is recording multiple real objects to a single pixel, smearing them together. In other words if the pixel reads 56, you cannot magically determine that the actual reading should have been 37 at time 1, and 62 at time 2, and 43 at time 3.
Another way to look at this: imagine you have 5 pictures. You then use photoshop to blend the pictures together, averaging the value of each pixel. Can you now somehow from the blended picture tell what the original 5 pictures were? No, you cannot, because you do not have the information to do that.

What is the best way to divide a video into scenes (segments)

I've been requested to take a given video, probably a simple cartoon, and return an array of its scenes.
I need to use the opencv libary in order to do it, and the result format is irrelevent (i.e. I can return the timespans of each scene or actualy split the video).
Any help would be appriciated.
Thanks
Technically, a scene is a group shots which are successively taken together at a single location. A shot is a basic narrative element of the video which is composed of a number of frames that are presented from a continuous viewpoint.
Automatically dividing a video into its shots is called the shot boundary detection problem in which the basic idea is identifying consecutive frames that form a transition from one shot to another.
Identifying transitions generally involve calculating a similarity value between two frames. This value can be calculated using low level image features such as color, edge or motion. A simple similarity metric could be:
s(f1, f2) = sum(i in all pixel locations)(abs(ficolor(i) - f2color(i))) / N
where f1 and f2 represent two distinct video frames and N represents number pixels in those frames. This is the average first order (Manhattan) pixel color distance between two frames.
Say you have a video composed of frames { f1, f2 ... fM } and you have calculated this distance between neighboring frames. A simple decision measure could be labeling a transition from fa to fb as a shot boundary if s(fa, fb) is below a certain threshold.
A successful shot boundary detector uses distances of second order (or more) such as Euclidean distance or Pearson correlation coefficient and utilizes a combination of different features instead of using only one, say color.
Usually, a camera or object movement breaks the pixel correspondence between frames. Using frequencies of low level details with the help of histograms will be a cure here.
Also, performing decision making over more than two frames helps in finding smooth transitions where one shot dissolves into or replaces another for a duration. Deciding for a group of frames also help us in identifying false transitions caused by light flashes or fast moving cameras.
For your problem, please start from basic approaches like comparing RGB colors and edge responses between video frames. Analyze your results and data together and try to adapt new features, distance metrics and decision making methods for better performance.
The best way of segmenting a video into shots will vary depending on your data. Machine learning approaches like probabilistically modeling frame transitions with Gaussian mixture models or classification through support vector machines are expected to perform better than hand selected thresholds. However it is important that you learn the basics before effectively choosing input features.
Automatically finding shot boundaries will be sufficent to divide your video into meaningful parts. Dividing your video into scenes, on the other hand, is considered a harder semantic problem. Nevertheless, shot segmentation is the first step to it.
There is a huge research area focusing on this. Search for papers, you can find various algorithms described in detail.
Here some examples:
Scene detection in Hollywood movies and TV shows
A framework for video scene boundary detection
Video summarization and scene detection by graph modeling
There are plenty more out there, just google.

finding Image shift

How to find shift and rotation between same two images using programming languages vb.net or C++ or C#?
The problem you state is called motion detection (or motion compensation) and is one of the most important problems in image and video processing at the moment. No easy "here are ten lines of code that will do it" solution exists except for some really trivial cases.
Even your seemingly trivial case is quite a difficult one because a rotation by an unknown angle could cause slight pixel-by-pixel changes that can't be easily detected without specifically tailored algorithms used for motion detection.
If the images are very similar such that the camera is only slightly moved and rotated then the problem could be solved without using highly complex techniques.
What I would do, in that case, is use a motion tracking algorithm to get the optical flow of the image sequence which is a "map" which approximates how a pixel has "moved" from image A to B. OpenCV which is indeed a very good library has functions that does this: CalcOpticalFlowLK and CalcOpticalFlowPyrLK.
The tricky bit is going from the optical flow to total rotation of the image. I would start by heavily low pass filter the optical flow to get a smoother map to work with.
Then you need to use some logic to test if the image is only shifted or rotated. If it is only shifted then the entire map should be one "color", i.e. all flow vectors point in the same direction.
If there has been a rotation then the vectors will point in different direction depending on the rotation.
If the input images are not as nice as the above method requires, then I would look into feature descriptors to find how a specific object in the first image is located within the second. This will however be much harder.
There is no short answer. You could try to use free OpenCV library for finding relationship between two images.
The two operations, rotation and translation can be determined in either order. It's far easier to first detect rotation, because you can then compensate for that. Once both images are oriented the same, the translation becomes a matter of simmple correlation.
Finding the relative rotation of an image is best done by determining the local gradients. For every neighborhood (e.g. 3x3 pixels), treat the greyvalue as a function z(x,y), fit a plane through the 9 pixels, and determine the slope or gradient of that plane. Now average the gradient you found over the entire image, or at least the center of it. Your two images will produce different averages. Part of that is because for non-90 degree rotations the images won't overlap fully, but in general the difference in average gradients is the rotation between the two.
Once you've rotated back one image, you can determine a correlation. This is a fairly standard operation; you're essentially determining for each possible offset how well the two images overlap. This will give you an estimate for the shift.
Once you've got both, you can refine your rotation angle estimate by rotating back the translation, shifting the second image, and determining the average gradient only over the pixels common to both images.
If the images are exactly the same, it should be fairly easy to extract some feature points - for example using SIFT - and match the features of both images. You can then use any two of the matching features to find the rotation and translation. The translation is just the difference between two matching feature points. The you compensate for the translation in one image and get the rotation angle as the angle formed by the three remaining points.