How does the DCT (Discrete Cosine Transform) help to compress sound (or any wave-like data)? According to the DCT transform there are N input values and N output values as a result. Where is the compression achieved and how?
The DCT does not compress. The size of the DCT output is the same as the size of the input signal. What the DCT does, however, is compact the energy of the signal. Roughly speaking, you end up with a small subset of big coefficients and a lot of small coefficients in the frequency domain. This situation is perfect for an entropy encoder that can remove the redundancies in the DCT output, thus providing compression.
Think about the sequence 1,2,3,4,5,.. It will not compress using LZ (zip) at all because there is zero repetition. Now encode the sequence as differences: 1,1,1,1,1,... Zip will compress it 99% now. Every algorithm detects a certain pattern well. DCT helps to encode the data into a format that is well compressible.
IMO it's an analysis of repetitions of certain values of the input (wave) presented in the form of frequencies (frequence + amplitude + repetition). For example if you have many low noise in the audio (bass) a DCT will oupout many values (=similar values) with low frequencies (i.e. think of an Equalizer-Band). This can be exploited for any compression algorithm. And also a DCT is loseless and reversible.
Related
After applying a Fourier transform to a signal, the energy of a single sine wave is often spread out across multiple bins (aka smearing). Have a look at the right side of the image below for an illustration:
I want to extract a list of peak frequencies. Just finding the highest bin is easy. But after that smearing becomes a problem.
I would like to have a heuristic which tells me if the magnitude of a specific bin is possibly the result of smearing or if there has to be another peak frequency in order to explain the signal. (It is better if I miss some than have false positives)
My naive approach would be to just calculate a few thousand examples and take the maximum of these to get an envelope curve so that any smearing is likely below that envelope.
But is there a better way to do this?
The FFT result of any rectangularly windowed pure unmodulated sinusoid is a Sinc function. This Sinc (sin(pix)/(pix)) function is only zero for all bins except one (at the peak magnitude) when the frequency of the input sinusoid has exactly an integer number of periods in the FFT width.
For all other frequencies that are not at the exact center of an FFT result bin, if you know that exact frequency and magnitude (which won't be in any single FFT bin), you can calculate all the other bins by sampling the Sinc function.
And, of course, if the input sinusoid isn't perfectly pure, but modulated in any way (amplitude, frequency or phase), this modulation will produce various sidebands in the FFT result as well.
I am having some issues that I am hoping you will be able to clarify. I have self taught myself a video encoding process similar to Mpeg2. The process is as follows:
Split an RGBA image into 4 separate channel data memory blocks. so an array of all R values, a separate array of G values etc.
take the array and grab a block of 8x8 pixel data, to transform it using the Discrete Cosine Transform (DCT).
Quantize this 8x8 block using a pre-calculated quantization matrix.
Zigzag encode the output of the quantization step. So I should get a trail of consecutive numbers.
Run Length Encode (RLE) the output from the zigzag algorithm.
Huffman Code the data after the RLE stage. Using substitution of values from a pre-computed huffman table.
Go back to step 2 and repeat until all the channels data has been encoded
Go back to step 2 and repeat for each channel
First question is do I need to convert the RGBA values to YUV+A (YCbCr+A) values for the process to work or can it continue using RGBA? I ask as the RGBA->YUVA conversion is a heavy workload that I would like to avoid if possible.
Next question. I am wondering should the RLE store runs for just 0's or can that be extended to all the values in the array? See examples below:
440000000111 == [2,4][7,0][3,1] // RLE for all values
or
440000000111 == 44[7,0]111 // RLE for 0's only
The final question is what would a single symbol be in regard to the huffman stage? would a symbol to be replaced be a value like 2 or 4, or would a symbol be the Run-level pair [2,4] for example.
Thanks for taking the time to read and help me out here. I have read many papers and watched many youtube videos, which have aided my understanding of the individual algorithms but not how they all link to together to form the encoding process in code.
(this seems more like JPEG than MPEG-2 - video formats are more about compressing differences between frames, rather than just image compression)
If you work in RGB rather than YUV, you're probably not going to get the same compression ratio and/or quality, but you can do that if you want. Colour-space conversion is hardly a heavy workload compared to the rest of the algorithm.
Typically in this sort of application you RLE the zeros, because that's the element that you get a lot of repetitions of (and hopefully also a good number at the end of each block which can be replaced with a single marker value), whereas other coefficients are not so repetitive but if you expect repetitions of other values, I guess YMMV.
And yes, you can encode the RLE pairs as single symbols in the huffman encoding.
1) Yes you'll want to convert to YUV... to achieve higher compression ratios, you need to take advantage of the human eye's ability to "overlook" significant loss in color. Typically, you'll keep your Y plane the same resolution (presumably the A plane as well), but downsample the U and V planes by 2x2. E.g. if you're doing 640x480, the Y is 640x480 and the U and V planes are 320x240. Also, you might choose different quantization for the U/V planes. The cost for this conversion is small compared to DCT or DFT.
2) You don't have to RLE it, you could just Huffman Code it directly.
Why JPEG compression processes image by 8x8 blocks instead of applying Discrete Cosine Transform to the whole image?
8 X 8 was chosen after numerous experiments with other sizes.
The conclusions of experiments are:
1. Any matrices of sizes greater than 8 X 8 are harder to do mathematical operations (like transforms etc..) or not supported by hardware or take longer time.
2. Any matrices of sizes less than 8 X 8 dont have enough information to continue along with the pipeline. It results in bad quality of the compressed image.
Because, that would take "forever" to decode. I don't remember fully now, but I think you need at least as many coefficients as there are pixels in the block. If you code the whole image as a single block I think you need to, for every pixel, iterate through all the DCT coefficients.
I'm not very good at big O calculations but I guess the complexity would be O("forever"). ;-)
For modern video codecs I think they've started using 16x16 blocks instead.
One good reason is that images (or at least the kind of images humans like to look at) have a high degree of information correlation locally, but not globally.
Every relatively smooth patch of skin, or piece of sky or grass or wall eventually ends in a sharp edge and is replaced by something entirely different. This means you still need a high frequency cutoff in order to represent the image adequately rather than just blur it out.
Now, because Fourier-like transforms like DCT "jumble" all the spacial information, you wouldn't be able to throw away any intermediate coefficients either, nor the high-frequency components "you don't like".
There are of course other ways to try to discard visual noise and reconstruct edges at the same time by preserving high frequency components only when needed, or do some iterative reconstruction of the image at finer levels of detail. You might want to look into space-scale representation and wavelet transforms.
I have a current implementation of Gaussian Blur using regular convolution. It is efficient enough for small kernels, but once the kernels size gets a little bigger, the performance takes a hit. So, I am thinking to implement the convolution using FFT. I've never had any experience with FFT related image processing so I have a few questions.
Is a 2D FFT based convolution also separable into two 1D convolutions ?
If true, does it go like this - 1D FFT on every row, and then 1D FFT on every column, then multiply with the 2D kernel and then inverse transform of every column and the inverse transform of every row? Or do I have to multiply with a 1D kernel after each 1D FFT Transform?
Now I understand that the kernel size should be the same size as the image (row in case of 1D). But how will it affect the edges? Do I have to pad the image edges with zeros? If so the kernel size should be equal to the image size before or after padding?
Also, this is a C++ project, and I plan on using kissFFT, since this is a commercial project. You are welcome to suggest any better alternatives. Thank you.
EDIT: Thanks for the responses, but I have a few more questions.
I see that the imaginary part of the input image will be all zeros. But will the output imaginary part will also be zeros? Do I have to multiply the Gaussian kernel to both real and imaginary parts?
I have instances of the same image to be blurred at different scales, i.e. the same image is scaled to different sizes and blurred at different kernel sizes. Do I have to perform a FFT every time I scale the image or can I use the same FFT?
Lastly, If I wanted to visualize the FFT, I understand that a log filter has to be applied to the FFT. But I am really lost on which part should be used to visualize FFT? The real part or the imaginary part.
Also for an image of size 512x512, what will be the size of real and imaginary parts. Will they be the same length?
Thank you again for your detailed replies.
The 2-D FFT is seperable and you are correct in how to perform it except that you must multiply by the 2-D FFT of the 2D kernel. If you are using kissfft, an easier way to perform the 2-D FFT is to just use kiss_fftnd in the tools directory of the kissfft package. This will do multi-dimensional FFTs.
The kernel size does not have to be any particular size. If the kernel is smaller than the image, you just need to zero-pad up to the image size before performing the 2-D FFT. You should also zero pad the image edges since the convoulution being performed by multiplication in the frequency domain is actually circular convolution and results wrap around at the edges.
So to summarize (given that the image size is M x N):
come up with a 2-D kernel of any size (U x V)
zero-pad the kernel up to (M+U-1) x (N+V-1)
take the 2-D fft of the kernel
zero-pad the image up to (M+U-1) x (N+V-1)
take the 2-D FFT of the image
multiply FFT of kernel by FFT of image
take inverse 2-D FFT of result
trim off garbage at edges
If you are performing the same filter multiple times on different images, you don't have to perform 1-3 every time.
Note: The kernel size will have to be rather large for this to be faster than direct computation of convolution. Also, did you implement your direct convolution taking advantage of the fact that a 2-D gaussian filter is separable (see this a few paragraphs into the "Mechanics" section)? That is, you can perform the 2-D convolution as 1-D convolutions on the rows and then the columns. I have found this to be faster than most FFT-based approaches unless the kernels are quite large.
Response to Edit
If the input is real, the output will still be complex except for rare circumstances. The FFT of your gaussian kernel will also be complex, so the multiply must be a complex multiplication. When you perform the inverse FFT, the output should be real since your input image and kernel are real. The output will be returned in a complex array, but the imaginary components should be zero or very small (floating point error) and can be discarded.
If you are using the same image, you can reuse the image FFT, but you will need to zero pad based on your biggest kernel size. You will have to compute the FFTs of all of the different kernels.
For visualization, the magnitude of the complex output should be used. The log scale just helps to visualize smaller components of the output when larger components would drown them out in a linear scale. The Decibel scale is often used and is given by either 20*log10(abs(x)) or 10*log10(x*x') which are equivalent. (x is the complex fft output and x' is the complex conjugate of x).
The input and output of the FFT will be the same size. Also the real and imaginary parts will be the same size since one real and one imaginary value form a single sample.
Remember that convolution in space is equivalent to multiplication in frequency domain. This means that once you perform FFT of both image and mask (kernel), you only have to do point-by-point multiplication, and then IFFT of the result. Having said that, here are a few words of caution.
You probably know that in digital signal processing, we often use circular convolution, not linear convolution. This happens because of curious periodicity. What this means in simple terms is that DFT (and FFT which is its computationally efficient variant) assumes that you signal is periodic, and when you filter your signal in such manner -- suppose your image is N x M pixels -- that it takes pixel at (1,m) to the the neighbor or pixel at (N, m) for some m<M. You signal virtually wraps around onto itself. This means that your Gaussian mask will be averaging pixels on the far right with pixels on the far left, and same goes for top and bottom. This might or might not be desired, but in general one has to deal with edging artifacts anyway. It is however much easier to forget about this issue when dealing with FFT multiplication because the problem stops being apparent. There are many ways to take care of this problem. The best way is to simply pad your image with zeros and remove the extra pixels later.
A very neat thing about using a Gaussian filter in frequency domain is that you never really have to take its FFT. It si a well-know fact that Fourier transform of a Gaussian is a Gaussian (some technical details here). All you would have to do then is pad you image with zeros (both top and bottom), generate a Gaussian in the frequency domain, multiply them together and take IFFT. Then you're done.
Hope this helps.
I have been trying to understand how video encoding works for modern encoders, in particular H264.
It is very often mentioned in documentation that residual frames are created from the differences between the current p-frame and the last i-frame (assuming the following frames are not used in the prediction). I understand that a YUV color space is used (maybe YV12), and that one image is "substracted" from the other and then the residual is formed.
What I don't understand is how exactly this substraction works. I don't think it is an absolute value of the difference because that would be ambiguous. What is the per pixel formula to obtain this difference?
Subtraction is just one small step in video encoding; the core principle behind most modern video encoding is motion estimation, followed by motion compensation. Basically, the process of motion estimation generates vectors that show offsets between macroblocks in successive frames. However, there's always a bit of error in these vectors.
So what happens is the encoder will output both the vector offsets, and the "residual" is what's left. The residual is not simply the difference between two frames; it's the difference between the two frames after motion estimation is taken into account. See the "Motion compensated difference" image in the wikipedia article on compensation for a clear illustration of this--note that the motion compensated difference is drastically smaller than the "dumb" residual.
Here's a decent PDF that goes over some of the basics.
A few other notes:
Yes, YUV is always used, and typically most encoders work in YV12 or some other chroma subsampled format
Subtraction will have to happen on the Y, U and V frames separately (think of them as three separate channels, all of which need to be encoded--then it becomes pretty clear how subtraction has to happen). Motion estimation may or may not happen on Y, U and V planes; sometimes encoders only do it on the Y (the luminance) values to save a bit of CPU at the expense of quality.