What is the best way to divide a video into scenes (segments) - c++

I've been requested to take a given video, probably a simple cartoon, and return an array of its scenes.
I need to use the opencv libary in order to do it, and the result format is irrelevent (i.e. I can return the timespans of each scene or actualy split the video).
Any help would be appriciated.
Thanks

Technically, a scene is a group shots which are successively taken together at a single location. A shot is a basic narrative element of the video which is composed of a number of frames that are presented from a continuous viewpoint.
Automatically dividing a video into its shots is called the shot boundary detection problem in which the basic idea is identifying consecutive frames that form a transition from one shot to another.
Identifying transitions generally involve calculating a similarity value between two frames. This value can be calculated using low level image features such as color, edge or motion. A simple similarity metric could be:
s(f1, f2) = sum(i in all pixel locations)(abs(ficolor(i) - f2color(i))) / N
where f1 and f2 represent two distinct video frames and N represents number pixels in those frames. This is the average first order (Manhattan) pixel color distance between two frames.
Say you have a video composed of frames { f1, f2 ... fM } and you have calculated this distance between neighboring frames. A simple decision measure could be labeling a transition from fa to fb as a shot boundary if s(fa, fb) is below a certain threshold.
A successful shot boundary detector uses distances of second order (or more) such as Euclidean distance or Pearson correlation coefficient and utilizes a combination of different features instead of using only one, say color.
Usually, a camera or object movement breaks the pixel correspondence between frames. Using frequencies of low level details with the help of histograms will be a cure here.
Also, performing decision making over more than two frames helps in finding smooth transitions where one shot dissolves into or replaces another for a duration. Deciding for a group of frames also help us in identifying false transitions caused by light flashes or fast moving cameras.
For your problem, please start from basic approaches like comparing RGB colors and edge responses between video frames. Analyze your results and data together and try to adapt new features, distance metrics and decision making methods for better performance.
The best way of segmenting a video into shots will vary depending on your data. Machine learning approaches like probabilistically modeling frame transitions with Gaussian mixture models or classification through support vector machines are expected to perform better than hand selected thresholds. However it is important that you learn the basics before effectively choosing input features.
Automatically finding shot boundaries will be sufficent to divide your video into meaningful parts. Dividing your video into scenes, on the other hand, is considered a harder semantic problem. Nevertheless, shot segmentation is the first step to it.

There is a huge research area focusing on this. Search for papers, you can find various algorithms described in detail.
Here some examples:
Scene detection in Hollywood movies and TV shows
A framework for video scene boundary detection
Video summarization and scene detection by graph modeling
There are plenty more out there, just google.

Related

Calculating dispersion in Human Tracking

I am currently trying to track human heads from a CCTV. I am currently using colour histogram and LBP histogram comparison to check the affinity between bounding boxes. However sometimes these are not enough.
I was reading through a paper in the following link : paper where dispersion metric is described. However I still cannot clearly get it. For example I cannot understand what pi,j is referring to in the equation. Can someone kindly & clearly explain how I can find dispersion between bounding boxes in separate frames please?
You assistance is much appreciated :)
This paper tackles the tracking problem using a background model, as most CCTV tracking methods do. The BG model produces a foreground mask, and the aforementioned p_ij relates to this mask after some morphology. Specifically, they try to separate foreground blobs into components, based on thresholds on allowed 'gaps' in FG mask holes. The end result of this procedure is a set of binary masks, one for each hypothesized object. These masks are then used for tracking using spatial and temporal consistency. In my opinion, this is an old fashioned way of processing video sequences, only relevant if you're limited in processing power and the scenes are not crowded.
To answer your question, if O is the mask related to one of the hypothesized objects, then p_ij is the binary pixel in the (i,j) location within the mask. Thus, c_x and c_y are the center of mass of the binary shape, and the dispersion is simply the average distance from the center of mass for the shape (it is larger for larger objects. This enforces scale consistency in tracking, but in a very weak manner. You can do much better if you have a calibrated camera.

OpenCV Image stiching when camera parameters are known

We have pictures taken from a plane flying over an area with 50% overlap and is using the OpenCV stitching algorithm to stitch them together. This works fine for our version 1. In our next iteration we want to look into a few extra things that I could use a few comments on.
Currently the stitching algorithm estimates the camera parameters. We do have camera parameters and a lot of information available from the plane about camera angle, position (GPS) etc. Would we be able to benefit anything from this information in contrast to just let the algorithm estimate everything based on matched feature points?
These images are taken in high resolution and the algorithm takes up quite amount of RAM at this point, not a big problem as we just spin large machines up in the cloud. But I would like to in our next iteration to get out the homography from down sampled images and apply it to the large images later. This will also give us more options to manipulate and visualize other information on the original images and be able to go back and forward between original and stitched images.
If we in question 1 is going to take apart the stitching algorithm to put in the known information, is it just using the findHomography method to get the info or is there better alternatives to create the homography when we actually know the plane position and angles and the camera parameters.
I got a basic understanding of opencv and is fine with c++ programming so its not a problem to write our own customized stitcher, but the theory is a bit rusty here.
Since you are using homographies to warp your imagery, I assume you are capturing areas small enough that you don't have to worry about Earth curvature effects. Also, I assume you don't use an elevation model.
Generally speaking, you will always want to tighten your (homography) model using matched image points, since your final output is a stitched image. If you have the RAM and CPU budget, you could refine your linear model using a max likelihood estimator.
Having a prior motion model (e.g. from GPS + IMU) could be used to initialize the feature search and match. With a good enough initial estimation of the feature apparent motion, you could dispense with expensive feature descriptor computation and storage, and just go with normalized crosscorrelation.
If I understand correctly, the images are taken vertically and overlap by a known amount of pixels, in that case calculating homography is a bit overkill: you're just talking about a translation matrix, and using more powerful algorithms can only give you bad conditioned matrixes.
In 2D, if H is a generalised homography matrix representing a perspective transformation,
H=[[a1 a2 a3] [a4 a5 a6] [a7 a8 a9]]
then the submatrixes R and T represent rotation and translation, respectively, if a9==1.
R= [[a1 a2] [a4 a5]], T=[[a3] [a6]]
while [a7 a8] represents the stretching of each axis. (All of this is a bit approximate since when all effects are present they'll influence each other).
So, if you known the lateral displacement, you can create a 3x3 matrix having just a3, a6 and a9=1 and pass it to cv::warpPerspective or cv::warpAffine.
As a criteria of matching correctness you can, f.e., calculate a normalized diff between pixels.

Opencv - How to differentiate jitter from panning?

I'm working on a video stabilizer using Opencv in C++.
At this time of the project I'm correctly able to find the translation between two consecutive frames with 3 different technique (Optical Flow, Phase Correlation, BFMatcher on points of interests).
To obtain a stabilized image I add up all the translation vector (from consecutive frame) to one, which is used in warpAffine function to correct the output image.
I'm having good result on fixed camera but result on camera in translation are really bad : the image disappear from the screen.
I think I have to distinguish jitter movement that I want to remove from panning movement that I want to keep. But I'm open to others solutions.
Actually the whole problem is a bit more complex than you might have thought in the beginning. Let's look a it this way: when you move your camera through the world, things that move close to the camera move faster than the ones in the background - so objects at different depths change their relative distance (look at your finder while moving the head and see how it points to different things). This means the image actually transforms and does not only translate (move in x or y) - so how do you want to accompensate for that? What you you need to do is to infer how much the camera moved (translation along x,y and z) and how much it rotated (with the angles of yaw, pan and tilt). This is a not very trivial task but openCV comes with a very nice package: http://opencv.willowgarage.com/documentation/camera_calibration_and_3d_reconstruction.html
So I recommend you to read as much on Homography(http://en.wikipedia.org/wiki/Homography), camera models and calibration as possible and then think what you actually want to stabilize for and if it is only for the rotation angles, the task is much simpler than if you would also like to stabilize for translational jitters.
If you don't want to go fancy and neglect the third dimension, I suggest that you average the optic flow, high-pass filter it and compensate this movement with a image translation into the oposite direction. This will keep your image more or less in the middle of the frame and only small,fast changes will be counteracted.
I would suggest you the possible approaches (in complexity order):
apply some easy-to-implement IIR low pass filter on the translation vectors before applying the stabilization. This will separate the high frequency (jitter) from the low frequency (panning)
same idea, a bit more complex, use Kalman filtering to track a motion with constant velocity or acceleration. You can use OpenCV's Kalman filter for that.
A bit more tricky, put a threshold on the motion amplitude to decide between two states (moving vs static camera) and filter the translation or not.
Finaly, you can use some elaborate technique from machine Learning to try to identify the user's desired motion (static, panning, etc.) and filter or not the motion vectors used for the stabilization.
Just a threshold is not a low pass filter.
Possible low pass filters (that are easy to implement):
there is the well known averaging, that is already a low-pass filter whose cutoff frequency depends on the number of samples that go into the averaging equation (the more samples the lower the cutoff frequency).
One frequently used filter is the exponential filter (because it forgets the past with an exponential rate decay). It is simply computed as x_filt(k) = a*x_nofilt(k) + (1-a)x_filt(k-1) with 0 <= a <= 1.
Another popular filter (and that can be computed beyond order 1) is the Butterworth filter.
Etc Low pass filters on Wikipedia, IIR filters...

Which deconvolution algorithm is best suited for removing motion blur from text?

I'm using OpenCV to process pictures taken with a mobile phone. The pictures contain text, and they have small amounts of motion blur, which I need to remove.
What would be the most viable algorithm to use? I have tested so far Lucy-Richardson and Weiner deconvolution, but they did not yield satisfactory results.
Agree with #TheJuice, your problem lies in the PSF estimation. Usually to be able to do this from a single frame, several assumptions need to be made about the factors leading to the blur (motion of object, type of motion of the sensor, etc.).
You can find some pointers, especially on the monodimensional case, here. They use a filtering method that leaves mostly correlation from the blur, discarding spatial correlation of original image, and use this to deduce motion direction and thence the PSF. For small blurs you might be able to consider the motion as constant; otherwise you will have to use a more complex accelerated motion model.
Unfortunately, mobile phone blur is often a compound of CCD integration and non-linear motion (translation perpendicular to line of sight, yaw from wrist motion, and rotation around the wrist), so Yitzhaky and Kopeika's method will probably only yield acceptable results in a minority of cases. I know there are methods to deal with that ("depth awareness" and other) but I have never had occasion of dealing with them.
You can preview the results using photo recovery software such as Focus Magic; while they do not employ YK estimator (motion description is left to you), the remaining workflow is necessarily very similar. If your pictures are amenable to Focus Magic recovery, then probably YK method will work. If they are not (or not enough, or not enough of them to be worthwhile), then there's no point even trying to implement it.
Motion blur is a difficult problem to overcome. The best results are gained when
The speed of the camera relative to the scene is known
You have many pictures of the blurred object which you can correlate.
You do have one major advantage in that you are looking at text (which normally constitutes high contrast features). If you only apply deconvolution to high contrast (I know that the theory is often to exclude high contrast) areas of your image you should get results which may enable you to better recognise characters. Also a combination of sharpening/blurring filters pre/post processing may help.
I remember being impressed with this paper previously. Perhaps an adaption on their implementation would be worth a go.
I think the estimation of your point-spread function is likely to be more important than the algorithm used. It depends on the kind of motion blur you're trying to remove, linear motion is likely to be the easiest but is unlikely to be the kind you're trying to remove: i imagine it's non-linear caused by hand movement during the exposure.
You cannot eliminate motion blur. The information is lost forever. What you are dealing with is a CCD that is recording multiple real objects to a single pixel, smearing them together. In other words if the pixel reads 56, you cannot magically determine that the actual reading should have been 37 at time 1, and 62 at time 2, and 43 at time 3.
Another way to look at this: imagine you have 5 pictures. You then use photoshop to blend the pictures together, averaging the value of each pixel. Can you now somehow from the blended picture tell what the original 5 pictures were? No, you cannot, because you do not have the information to do that.

Detect if images are different in real-time

I am working on a microscope that streams live images via a built-in video camera to a PC, where further image processing can be performed on the streamed image. Any processing done on the streamed image must be done in "real-time" (minimal frames dropped).
We take the average of a series of static images to counter random noise from the camera to improve the output of some of our image processing routines.
My question is: how do I know if the image is no longer static - either the sample under inspection has moved or rotated/camera zoom-in or out - so I can reset the image series used for averaging?
I looked through some of the threads, and some ideas that seemed interesting:
Note: using Windows, C++ and Intel IPP. With IPP the image is a byte array (Ipp8u).
1. Hash the images, and compare the hashes (normal hash or perceptual hash?)
2. Use normalized cross correlation (IPP has many variations - which to use?)
Which do you guys think is suitable for my situation (speed)?
If you camera doesn't shake, you can, as inVader said, subtract images. Then a sum of absolute values of all pixels of the difference image is sometimes enough to tell if images are the same or different. However, if your noise, lighting level, etc... varies, this will not give you a good enough S/N ratio.
And in noizy conditions normal hashes are even more useless.
The best would be to identify that some features of your object has changed, like it's boundary (if it's regular) or it's mass center (if it's irregular). If you have a boundary position, you'll need to analyze just one line of pixels, perpendicular to that boundary, to tell that boundary has moved.
Mass center position may be a subject to frequent false-negative responses, but adding a total mass and/or moment of inertia may help.
If the camera shakes, you may have to align images before comparing (depending on comparison method and required accuracy, a single pixel misalignment might be huge), and that's where cross-correlation helps.
And further, you doesn't have to analyze each image. You can skip one, and if the next differs, discard both of them. Here you have twice as much time to analyze an image.
And if you are averaging images, you might just define an optimal amount of images you need and compare just the first and the last image in the sequence.
So, simplest thing to try would be to take subsequent images, subtract them from each other and have a look at the difference. Then define some rules including local and global thresholds for the difference in which two images are considered equal. Simple subtraction of bitmap/array data, looking for maxima and calculating the average differnce across the whole thing should be ne problem to do in real time.
If there are varying light conditions or something moving in a predictable way(like a door opening and closing), then something more powerful, albeit slower, like gaussian mixture models for background modeling, might be worth looking into, click here. It is quite compute intensive, but can be parallelized pretty easily.
Motion detection algorithms is what is used.
http://www.codeproject.com/Articles/10248/Motion-Detection-Algorithms
http://www.codeproject.com/Articles/22243/Real-Time-Object-Tracker-in-C
First of all I would take a series of images at a slow fps rate and downsample those images to make them smaller, not too much but enough to speed up the process.
Now you have several options:
You could make a sum of absolute differences of the two images by subtracting them and use a threshold to value if the image has changed.
If you want to speed it up even further I would suggest doing a progressive SAD using a small kernel and moving from the top of the image to the bottom. You can value the complessive amount of differences during the process and eventually stop when you are satisfied.