How does H.264 or video encoders in general compute the residual image of two frames? - compression

I have been trying to understand how video encoding works for modern encoders, in particular H264.
It is very often mentioned in documentation that residual frames are created from the differences between the current p-frame and the last i-frame (assuming the following frames are not used in the prediction). I understand that a YUV color space is used (maybe YV12), and that one image is "substracted" from the other and then the residual is formed.
What I don't understand is how exactly this substraction works. I don't think it is an absolute value of the difference because that would be ambiguous. What is the per pixel formula to obtain this difference?

Subtraction is just one small step in video encoding; the core principle behind most modern video encoding is motion estimation, followed by motion compensation. Basically, the process of motion estimation generates vectors that show offsets between macroblocks in successive frames. However, there's always a bit of error in these vectors.
So what happens is the encoder will output both the vector offsets, and the "residual" is what's left. The residual is not simply the difference between two frames; it's the difference between the two frames after motion estimation is taken into account. See the "Motion compensated difference" image in the wikipedia article on compensation for a clear illustration of this--note that the motion compensated difference is drastically smaller than the "dumb" residual.
Here's a decent PDF that goes over some of the basics.
A few other notes:
Yes, YUV is always used, and typically most encoders work in YV12 or some other chroma subsampled format
Subtraction will have to happen on the Y, U and V frames separately (think of them as three separate channels, all of which need to be encoded--then it becomes pretty clear how subtraction has to happen). Motion estimation may or may not happen on Y, U and V planes; sometimes encoders only do it on the Y (the luminance) values to save a bit of CPU at the expense of quality.

Related

OpenCV: Detecting seizure-inducing lights in a video?

I have been working on an algorithm which can detect seizure-inducing strobe lights in a video.
Currently, my code returns virtually every frame as capable of causing a seizure (3Hz flashes).
My code calculates the relative luminance of each pixel and sees how many times the luminance goes up then down, etc. or down then up, etc. by more than 10% within any given second.
Is there any way to do this without comparing each individual pixel within a second of each other and that only returns the correct frames.
An example of what I am trying to emulate: https://trace.umd.edu/peat
The common approach to solving this type of problems is to convert the frames to grayscale and then construct a cube containing frames from a 1 to 3 seconds time interval. From this cube, you can extract the time-varying characteristics of either individual pixels (noisy), or blocks (recommended). The resulting 1D curves can first be observed manually to see if they actually show the 3Hz variation that you are looking for (sometimes, these variations are either lost or distorted because of the camera's auto exposure settings). If you can see it, they you should be able to use FFT to isolate and detect it automatically.
Convert the image to grayscale. Break the image up into blocks, maybe 16x16 or 64x64 or larger (experiment to see what works). Take the average luminance of each block over a minimum of 2/3 seconds. Create a wave of luminance over time. Do an fft on this wave and look for a minimum energy threshold around 3Hz.

DCT based Video Encoding Process

I am having some issues that I am hoping you will be able to clarify. I have self taught myself a video encoding process similar to Mpeg2. The process is as follows:
Split an RGBA image into 4 separate channel data memory blocks. so an array of all R values, a separate array of G values etc.
take the array and grab a block of 8x8 pixel data, to transform it using the Discrete Cosine Transform (DCT).
Quantize this 8x8 block using a pre-calculated quantization matrix.
Zigzag encode the output of the quantization step. So I should get a trail of consecutive numbers.
Run Length Encode (RLE) the output from the zigzag algorithm.
Huffman Code the data after the RLE stage. Using substitution of values from a pre-computed huffman table.
Go back to step 2 and repeat until all the channels data has been encoded
Go back to step 2 and repeat for each channel
First question is do I need to convert the RGBA values to YUV+A (YCbCr+A) values for the process to work or can it continue using RGBA? I ask as the RGBA->YUVA conversion is a heavy workload that I would like to avoid if possible.
Next question. I am wondering should the RLE store runs for just 0's or can that be extended to all the values in the array? See examples below:
440000000111 == [2,4][7,0][3,1] // RLE for all values
or
440000000111 == 44[7,0]111 // RLE for 0's only
The final question is what would a single symbol be in regard to the huffman stage? would a symbol to be replaced be a value like 2 or 4, or would a symbol be the Run-level pair [2,4] for example.
Thanks for taking the time to read and help me out here. I have read many papers and watched many youtube videos, which have aided my understanding of the individual algorithms but not how they all link to together to form the encoding process in code.
(this seems more like JPEG than MPEG-2 - video formats are more about compressing differences between frames, rather than just image compression)
If you work in RGB rather than YUV, you're probably not going to get the same compression ratio and/or quality, but you can do that if you want. Colour-space conversion is hardly a heavy workload compared to the rest of the algorithm.
Typically in this sort of application you RLE the zeros, because that's the element that you get a lot of repetitions of (and hopefully also a good number at the end of each block which can be replaced with a single marker value), whereas other coefficients are not so repetitive but if you expect repetitions of other values, I guess YMMV.
And yes, you can encode the RLE pairs as single symbols in the huffman encoding.
1) Yes you'll want to convert to YUV... to achieve higher compression ratios, you need to take advantage of the human eye's ability to "overlook" significant loss in color. Typically, you'll keep your Y plane the same resolution (presumably the A plane as well), but downsample the U and V planes by 2x2. E.g. if you're doing 640x480, the Y is 640x480 and the U and V planes are 320x240. Also, you might choose different quantization for the U/V planes. The cost for this conversion is small compared to DCT or DFT.
2) You don't have to RLE it, you could just Huffman Code it directly.

Detect if images are different in real-time

I am working on a microscope that streams live images via a built-in video camera to a PC, where further image processing can be performed on the streamed image. Any processing done on the streamed image must be done in "real-time" (minimal frames dropped).
We take the average of a series of static images to counter random noise from the camera to improve the output of some of our image processing routines.
My question is: how do I know if the image is no longer static - either the sample under inspection has moved or rotated/camera zoom-in or out - so I can reset the image series used for averaging?
I looked through some of the threads, and some ideas that seemed interesting:
Note: using Windows, C++ and Intel IPP. With IPP the image is a byte array (Ipp8u).
1. Hash the images, and compare the hashes (normal hash or perceptual hash?)
2. Use normalized cross correlation (IPP has many variations - which to use?)
Which do you guys think is suitable for my situation (speed)?
If you camera doesn't shake, you can, as inVader said, subtract images. Then a sum of absolute values of all pixels of the difference image is sometimes enough to tell if images are the same or different. However, if your noise, lighting level, etc... varies, this will not give you a good enough S/N ratio.
And in noizy conditions normal hashes are even more useless.
The best would be to identify that some features of your object has changed, like it's boundary (if it's regular) or it's mass center (if it's irregular). If you have a boundary position, you'll need to analyze just one line of pixels, perpendicular to that boundary, to tell that boundary has moved.
Mass center position may be a subject to frequent false-negative responses, but adding a total mass and/or moment of inertia may help.
If the camera shakes, you may have to align images before comparing (depending on comparison method and required accuracy, a single pixel misalignment might be huge), and that's where cross-correlation helps.
And further, you doesn't have to analyze each image. You can skip one, and if the next differs, discard both of them. Here you have twice as much time to analyze an image.
And if you are averaging images, you might just define an optimal amount of images you need and compare just the first and the last image in the sequence.
So, simplest thing to try would be to take subsequent images, subtract them from each other and have a look at the difference. Then define some rules including local and global thresholds for the difference in which two images are considered equal. Simple subtraction of bitmap/array data, looking for maxima and calculating the average differnce across the whole thing should be ne problem to do in real time.
If there are varying light conditions or something moving in a predictable way(like a door opening and closing), then something more powerful, albeit slower, like gaussian mixture models for background modeling, might be worth looking into, click here. It is quite compute intensive, but can be parallelized pretty easily.
Motion detection algorithms is what is used.
http://www.codeproject.com/Articles/10248/Motion-Detection-Algorithms
http://www.codeproject.com/Articles/22243/Real-Time-Object-Tracker-in-C
First of all I would take a series of images at a slow fps rate and downsample those images to make them smaller, not too much but enough to speed up the process.
Now you have several options:
You could make a sum of absolute differences of the two images by subtracting them and use a threshold to value if the image has changed.
If you want to speed it up even further I would suggest doing a progressive SAD using a small kernel and moving from the top of the image to the bottom. You can value the complessive amount of differences during the process and eventually stop when you are satisfied.

Why JPEG compression processes image by 8x8 blocks?

Why JPEG compression processes image by 8x8 blocks instead of applying Discrete Cosine Transform to the whole image?
8 X 8 was chosen after numerous experiments with other sizes.
The conclusions of experiments are:
1. Any matrices of sizes greater than 8 X 8 are harder to do mathematical operations (like transforms etc..) or not supported by hardware or take longer time.
2. Any matrices of sizes less than 8 X 8 dont have enough information to continue along with the pipeline. It results in bad quality of the compressed image.
Because, that would take "forever" to decode. I don't remember fully now, but I think you need at least as many coefficients as there are pixels in the block. If you code the whole image as a single block I think you need to, for every pixel, iterate through all the DCT coefficients.
I'm not very good at big O calculations but I guess the complexity would be O("forever"). ;-)
For modern video codecs I think they've started using 16x16 blocks instead.
One good reason is that images (or at least the kind of images humans like to look at) have a high degree of information correlation locally, but not globally.
Every relatively smooth patch of skin, or piece of sky or grass or wall eventually ends in a sharp edge and is replaced by something entirely different. This means you still need a high frequency cutoff in order to represent the image adequately rather than just blur it out.
Now, because Fourier-like transforms like DCT "jumble" all the spacial information, you wouldn't be able to throw away any intermediate coefficients either, nor the high-frequency components "you don't like".
There are of course other ways to try to discard visual noise and reconstruct edges at the same time by preserving high frequency components only when needed, or do some iterative reconstruction of the image at finer levels of detail. You might want to look into space-scale representation and wavelet transforms.

Video upsampling with C/C++

I want to upsample an array of captured (from webcam) OpenCV images or corresponding float arrays (Pixel values don't need to be discrete integer). Unfortunately the upsampling ratio is not always integer, so I cannot figure myself how to do it with simple linear interpolation.
Is there an easier way or a library to do this?
Well, I dont know a library to to do framerate scaling.
But I can tell you that the most appropriate way to do it yourself is by just dropping or doubling frames.
Blending pictures by simple linear pixel interpolation will not improve quality, playback will still look jerky and even also blurry now.
To proper interpolate frame rates much more complicated algorithms are needed.
Modern TV's have build in hardware for that and video editing software like e.g. After-Effects has functions that do it.
These algorithms are able to create in beetween pictures by motion analysis. But that is beyond the range of a small problem solution.
So either go on searching for an existing library you can use or do it by just dropping/doubling frames.
The ImageMagick MagickWand library will resize images using proper filtering algorithms - see the MagickResizeImage() function (and use the Sinc filter).
I am not 100% familiar with video capture, so I'm not sure what you mean by "pixel values don't need to be discrete integer". Does this mean the color information per pixel may not be integers?
I am assuming that by "the upsampling ratio is not always integer", you mean that you will upsample from one resolution to another, but you might not be doubling or tripling. For example, instead of 640x480 -> 1280x960, you may be doing, 640x480 -> 800x600.
A simple algorithm might be:
For each pixel in the larger grid
Scale the x/y values to lie between 0,1 (divide x by width, y by height)
Scale the x/y values by the width/height of the smaller grid -> xSmaller, ySmaller
Determine the four pixels that contain your point, via floating point floor/ceiling functions
Get the x/y values of where the point lies within that rectangle,between 0,1 (subtract the floor/ceiling values xSmaller, ySmaller) -> xInterp, yInterp
Start with black, and add your four colors, scaled by the xInterp/yInterp factors for each
You can make this faster for multiple frames by creating a lookup table to map pixels -> xInterp/yInterp values
I am sure there are much better algorithms out there than linear interpolation (bilinear, and many more). This seems like the sort of thing you'd want optimized at the processor level.
Use libswscale from the ffmpeg project. It is the most optimized and supports a number of different resampling algorithms.