DCT based Video Encoding Process - c++

I am having some issues that I am hoping you will be able to clarify. I have self taught myself a video encoding process similar to Mpeg2. The process is as follows:
Split an RGBA image into 4 separate channel data memory blocks. so an array of all R values, a separate array of G values etc.
take the array and grab a block of 8x8 pixel data, to transform it using the Discrete Cosine Transform (DCT).
Quantize this 8x8 block using a pre-calculated quantization matrix.
Zigzag encode the output of the quantization step. So I should get a trail of consecutive numbers.
Run Length Encode (RLE) the output from the zigzag algorithm.
Huffman Code the data after the RLE stage. Using substitution of values from a pre-computed huffman table.
Go back to step 2 and repeat until all the channels data has been encoded
Go back to step 2 and repeat for each channel
First question is do I need to convert the RGBA values to YUV+A (YCbCr+A) values for the process to work or can it continue using RGBA? I ask as the RGBA->YUVA conversion is a heavy workload that I would like to avoid if possible.
Next question. I am wondering should the RLE store runs for just 0's or can that be extended to all the values in the array? See examples below:
440000000111 == [2,4][7,0][3,1] // RLE for all values
or
440000000111 == 44[7,0]111 // RLE for 0's only
The final question is what would a single symbol be in regard to the huffman stage? would a symbol to be replaced be a value like 2 or 4, or would a symbol be the Run-level pair [2,4] for example.
Thanks for taking the time to read and help me out here. I have read many papers and watched many youtube videos, which have aided my understanding of the individual algorithms but not how they all link to together to form the encoding process in code.

(this seems more like JPEG than MPEG-2 - video formats are more about compressing differences between frames, rather than just image compression)
If you work in RGB rather than YUV, you're probably not going to get the same compression ratio and/or quality, but you can do that if you want. Colour-space conversion is hardly a heavy workload compared to the rest of the algorithm.
Typically in this sort of application you RLE the zeros, because that's the element that you get a lot of repetitions of (and hopefully also a good number at the end of each block which can be replaced with a single marker value), whereas other coefficients are not so repetitive but if you expect repetitions of other values, I guess YMMV.
And yes, you can encode the RLE pairs as single symbols in the huffman encoding.

1) Yes you'll want to convert to YUV... to achieve higher compression ratios, you need to take advantage of the human eye's ability to "overlook" significant loss in color. Typically, you'll keep your Y plane the same resolution (presumably the A plane as well), but downsample the U and V planes by 2x2. E.g. if you're doing 640x480, the Y is 640x480 and the U and V planes are 320x240. Also, you might choose different quantization for the U/V planes. The cost for this conversion is small compared to DCT or DFT.
2) You don't have to RLE it, you could just Huffman Code it directly.

Related

Why JPEG compression processes image by 8x8 blocks?

Why JPEG compression processes image by 8x8 blocks instead of applying Discrete Cosine Transform to the whole image?
8 X 8 was chosen after numerous experiments with other sizes.
The conclusions of experiments are:
1. Any matrices of sizes greater than 8 X 8 are harder to do mathematical operations (like transforms etc..) or not supported by hardware or take longer time.
2. Any matrices of sizes less than 8 X 8 dont have enough information to continue along with the pipeline. It results in bad quality of the compressed image.
Because, that would take "forever" to decode. I don't remember fully now, but I think you need at least as many coefficients as there are pixels in the block. If you code the whole image as a single block I think you need to, for every pixel, iterate through all the DCT coefficients.
I'm not very good at big O calculations but I guess the complexity would be O("forever"). ;-)
For modern video codecs I think they've started using 16x16 blocks instead.
One good reason is that images (or at least the kind of images humans like to look at) have a high degree of information correlation locally, but not globally.
Every relatively smooth patch of skin, or piece of sky or grass or wall eventually ends in a sharp edge and is replaced by something entirely different. This means you still need a high frequency cutoff in order to represent the image adequately rather than just blur it out.
Now, because Fourier-like transforms like DCT "jumble" all the spacial information, you wouldn't be able to throw away any intermediate coefficients either, nor the high-frequency components "you don't like".
There are of course other ways to try to discard visual noise and reconstruct edges at the same time by preserving high frequency components only when needed, or do some iterative reconstruction of the image at finer levels of detail. You might want to look into space-scale representation and wavelet transforms.

How does H.264 or video encoders in general compute the residual image of two frames?

I have been trying to understand how video encoding works for modern encoders, in particular H264.
It is very often mentioned in documentation that residual frames are created from the differences between the current p-frame and the last i-frame (assuming the following frames are not used in the prediction). I understand that a YUV color space is used (maybe YV12), and that one image is "substracted" from the other and then the residual is formed.
What I don't understand is how exactly this substraction works. I don't think it is an absolute value of the difference because that would be ambiguous. What is the per pixel formula to obtain this difference?
Subtraction is just one small step in video encoding; the core principle behind most modern video encoding is motion estimation, followed by motion compensation. Basically, the process of motion estimation generates vectors that show offsets between macroblocks in successive frames. However, there's always a bit of error in these vectors.
So what happens is the encoder will output both the vector offsets, and the "residual" is what's left. The residual is not simply the difference between two frames; it's the difference between the two frames after motion estimation is taken into account. See the "Motion compensated difference" image in the wikipedia article on compensation for a clear illustration of this--note that the motion compensated difference is drastically smaller than the "dumb" residual.
Here's a decent PDF that goes over some of the basics.
A few other notes:
Yes, YUV is always used, and typically most encoders work in YV12 or some other chroma subsampled format
Subtraction will have to happen on the Y, U and V frames separately (think of them as three separate channels, all of which need to be encoded--then it becomes pretty clear how subtraction has to happen). Motion estimation may or may not happen on Y, U and V planes; sometimes encoders only do it on the Y (the luminance) values to save a bit of CPU at the expense of quality.

DCT compression

How does the DCT (Discrete Cosine Transform) help to compress sound (or any wave-like data)? According to the DCT transform there are N input values and N output values as a result. Where is the compression achieved and how?
The DCT does not compress. The size of the DCT output is the same as the size of the input signal. What the DCT does, however, is compact the energy of the signal. Roughly speaking, you end up with a small subset of big coefficients and a lot of small coefficients in the frequency domain. This situation is perfect for an entropy encoder that can remove the redundancies in the DCT output, thus providing compression.
Think about the sequence 1,2,3,4,5,.. It will not compress using LZ (zip) at all because there is zero repetition. Now encode the sequence as differences: 1,1,1,1,1,... Zip will compress it 99% now. Every algorithm detects a certain pattern well. DCT helps to encode the data into a format that is well compressible.
IMO it's an analysis of repetitions of certain values of the input (wave) presented in the form of frequencies (frequence + amplitude + repetition). For example if you have many low noise in the audio (bass) a DCT will oupout many values (=similar values) with low frequencies (i.e. think of an Equalizer-Band). This can be exploited for any compression algorithm. And also a DCT is loseless and reversible.

Video upsampling with C/C++

I want to upsample an array of captured (from webcam) OpenCV images or corresponding float arrays (Pixel values don't need to be discrete integer). Unfortunately the upsampling ratio is not always integer, so I cannot figure myself how to do it with simple linear interpolation.
Is there an easier way or a library to do this?
Well, I dont know a library to to do framerate scaling.
But I can tell you that the most appropriate way to do it yourself is by just dropping or doubling frames.
Blending pictures by simple linear pixel interpolation will not improve quality, playback will still look jerky and even also blurry now.
To proper interpolate frame rates much more complicated algorithms are needed.
Modern TV's have build in hardware for that and video editing software like e.g. After-Effects has functions that do it.
These algorithms are able to create in beetween pictures by motion analysis. But that is beyond the range of a small problem solution.
So either go on searching for an existing library you can use or do it by just dropping/doubling frames.
The ImageMagick MagickWand library will resize images using proper filtering algorithms - see the MagickResizeImage() function (and use the Sinc filter).
I am not 100% familiar with video capture, so I'm not sure what you mean by "pixel values don't need to be discrete integer". Does this mean the color information per pixel may not be integers?
I am assuming that by "the upsampling ratio is not always integer", you mean that you will upsample from one resolution to another, but you might not be doubling or tripling. For example, instead of 640x480 -> 1280x960, you may be doing, 640x480 -> 800x600.
A simple algorithm might be:
For each pixel in the larger grid
Scale the x/y values to lie between 0,1 (divide x by width, y by height)
Scale the x/y values by the width/height of the smaller grid -> xSmaller, ySmaller
Determine the four pixels that contain your point, via floating point floor/ceiling functions
Get the x/y values of where the point lies within that rectangle,between 0,1 (subtract the floor/ceiling values xSmaller, ySmaller) -> xInterp, yInterp
Start with black, and add your four colors, scaled by the xInterp/yInterp factors for each
You can make this faster for multiple frames by creating a lookup table to map pixels -> xInterp/yInterp values
I am sure there are much better algorithms out there than linear interpolation (bilinear, and many more). This seems like the sort of thing you'd want optimized at the processor level.
Use libswscale from the ffmpeg project. It is the most optimized and supports a number of different resampling algorithms.

How can I scale down an array of raw rgb data on a 16 bit display

I have an array of raw rgb data on a 16 bit display with dimension of 320 * 480. The size of the array is 320*480*4 = 6144000.
I would like to know how can I scale this down (80 * 120) without losing image quality?
I found this link about scaling image in 2D array, but how can I apply that to my array of 16 bit display? It is not a 2D array (because of it has 16 bit color).
Image scaling and rotating in C/C++
Thank you.
If you are scaling down a big image to a smaller one, you WILL lose image quality.
The question, then, is how to minimize that loss.
There are many algorithms that do this, each with strengths and weaknesses.
Typically you will apply some sort of filter to your image, such as Bilinear or Nearest Neighbor. Here is a discussion of such filters in the context of ImageMagick.
Also, if the output is going to be less than 16 bits per pixel, you need to do some form of Color Quantization.
I assume that you mean a 16 bit rgb display, not a display that has each color (red, green, and blue) as 16 bits. I also assume you know how your r, g, and b values are encoded in that 16 bit space, because there are two possibilities.
So, assuming you know how to split your color space up, you can now use a series of byte arrays to represent your data. What becomes a tricky decision is whether to go with byte arrays, because you have a body of algorithms that can already do the work on those arrays but will cost you a few extra bits per byte that you may not be able to spend, or to keep everything crammed into that 16 bit format and then do the work on the appropriate bits of each 16 bit pixel. Only you can really answer that question; if you have the memory, I'd opt for the byte array approach, because it's probably faster and you'll get a little extra precision to make the images look smooth(er) in the end.
Given those assumptions, the question is really answerable by how much time you have on your device. If you have a very fast device, you can implement a Lanczos resampling. If you have a less fast device, bicubic interpolation works very well as well. If you have an even slower device, bilinear interpolation is your friend.
If you really have no speed, I'd do the rescaling down in some external application, like photoshop, and save a series of bitmaps that you load as you need them.
There are plenty of methods of scaling down images, but none can guarantee not losing "quality". Ultimately information is lost during the rescaling process.
You have 16bit colors = 2bytes, but in your calculations you use 4 multiplier.
Maybe you don't needed reducing image size?
in general it is impossible to scale raster image without loosing quality. Some algorithms make scaling almost without visible quality loosing.
Since you are scaling down by a factor of 4, each 4x4 block of pixels in your original image will correspond to a single pixel in your output image. You can then loop through each 4x4 block in the original image and then reduce this to a single pixel. A simple way (perhaps not the best way) to do this reduction could be to take the average or median of the RGB components.
You should note that you cannot do image scaling without losing image quality unless for all the blocks in the original image, each pixel is the exact same colour (which is unlikely).