Gaussian Blur with FFT Questions - c++

I have a current implementation of Gaussian Blur using regular convolution. It is efficient enough for small kernels, but once the kernels size gets a little bigger, the performance takes a hit. So, I am thinking to implement the convolution using FFT. I've never had any experience with FFT related image processing so I have a few questions.
Is a 2D FFT based convolution also separable into two 1D convolutions ?
If true, does it go like this - 1D FFT on every row, and then 1D FFT on every column, then multiply with the 2D kernel and then inverse transform of every column and the inverse transform of every row? Or do I have to multiply with a 1D kernel after each 1D FFT Transform?
Now I understand that the kernel size should be the same size as the image (row in case of 1D). But how will it affect the edges? Do I have to pad the image edges with zeros? If so the kernel size should be equal to the image size before or after padding?
Also, this is a C++ project, and I plan on using kissFFT, since this is a commercial project. You are welcome to suggest any better alternatives. Thank you.
EDIT: Thanks for the responses, but I have a few more questions.
I see that the imaginary part of the input image will be all zeros. But will the output imaginary part will also be zeros? Do I have to multiply the Gaussian kernel to both real and imaginary parts?
I have instances of the same image to be blurred at different scales, i.e. the same image is scaled to different sizes and blurred at different kernel sizes. Do I have to perform a FFT every time I scale the image or can I use the same FFT?
Lastly, If I wanted to visualize the FFT, I understand that a log filter has to be applied to the FFT. But I am really lost on which part should be used to visualize FFT? The real part or the imaginary part.
Also for an image of size 512x512, what will be the size of real and imaginary parts. Will they be the same length?
Thank you again for your detailed replies.

The 2-D FFT is seperable and you are correct in how to perform it except that you must multiply by the 2-D FFT of the 2D kernel. If you are using kissfft, an easier way to perform the 2-D FFT is to just use kiss_fftnd in the tools directory of the kissfft package. This will do multi-dimensional FFTs.
The kernel size does not have to be any particular size. If the kernel is smaller than the image, you just need to zero-pad up to the image size before performing the 2-D FFT. You should also zero pad the image edges since the convoulution being performed by multiplication in the frequency domain is actually circular convolution and results wrap around at the edges.
So to summarize (given that the image size is M x N):
come up with a 2-D kernel of any size (U x V)
zero-pad the kernel up to (M+U-1) x (N+V-1)
take the 2-D fft of the kernel
zero-pad the image up to (M+U-1) x (N+V-1)
take the 2-D FFT of the image
multiply FFT of kernel by FFT of image
take inverse 2-D FFT of result
trim off garbage at edges
If you are performing the same filter multiple times on different images, you don't have to perform 1-3 every time.
Note: The kernel size will have to be rather large for this to be faster than direct computation of convolution. Also, did you implement your direct convolution taking advantage of the fact that a 2-D gaussian filter is separable (see this a few paragraphs into the "Mechanics" section)? That is, you can perform the 2-D convolution as 1-D convolutions on the rows and then the columns. I have found this to be faster than most FFT-based approaches unless the kernels are quite large.
Response to Edit
If the input is real, the output will still be complex except for rare circumstances. The FFT of your gaussian kernel will also be complex, so the multiply must be a complex multiplication. When you perform the inverse FFT, the output should be real since your input image and kernel are real. The output will be returned in a complex array, but the imaginary components should be zero or very small (floating point error) and can be discarded.
If you are using the same image, you can reuse the image FFT, but you will need to zero pad based on your biggest kernel size. You will have to compute the FFTs of all of the different kernels.
For visualization, the magnitude of the complex output should be used. The log scale just helps to visualize smaller components of the output when larger components would drown them out in a linear scale. The Decibel scale is often used and is given by either 20*log10(abs(x)) or 10*log10(x*x') which are equivalent. (x is the complex fft output and x' is the complex conjugate of x).
The input and output of the FFT will be the same size. Also the real and imaginary parts will be the same size since one real and one imaginary value form a single sample.

Remember that convolution in space is equivalent to multiplication in frequency domain. This means that once you perform FFT of both image and mask (kernel), you only have to do point-by-point multiplication, and then IFFT of the result. Having said that, here are a few words of caution.
You probably know that in digital signal processing, we often use circular convolution, not linear convolution. This happens because of curious periodicity. What this means in simple terms is that DFT (and FFT which is its computationally efficient variant) assumes that you signal is periodic, and when you filter your signal in such manner -- suppose your image is N x M pixels -- that it takes pixel at (1,m) to the the neighbor or pixel at (N, m) for some m<M. You signal virtually wraps around onto itself. This means that your Gaussian mask will be averaging pixels on the far right with pixels on the far left, and same goes for top and bottom. This might or might not be desired, but in general one has to deal with edging artifacts anyway. It is however much easier to forget about this issue when dealing with FFT multiplication because the problem stops being apparent. There are many ways to take care of this problem. The best way is to simply pad your image with zeros and remove the extra pixels later.
A very neat thing about using a Gaussian filter in frequency domain is that you never really have to take its FFT. It si a well-know fact that Fourier transform of a Gaussian is a Gaussian (some technical details here). All you would have to do then is pad you image with zeros (both top and bottom), generate a Gaussian in the frequency domain, multiply them together and take IFFT. Then you're done.
Hope this helps.

Related

PCA in thousands of dimensions

I have vectors in 4096 VLAD codes, each one of them representing an image.
I have to run PCA on them without reducing their dimension on learning datasets with less than 4096 images (e.g., the Holiday dataset with <2k images.) to obtain the rotation matrix A.
In this paper (which also explains why I need to run this PCA without dimensionality reduction) they solved the problem with this approach:
For efficient computation of A, we compute at most the first 1,024 eigenvectors by eigendecomposition of the covariance matrix, and
the remaining orthogonal complements up to D-dimensional are filled using Gram-Schmidt orthogonalization.
Now, how do I implement this using C++ libraries? I'm using OpenCV, but cv::PCA seems not to offer such a strategy. Is there any way to do this?

Predict smearing of FFT

After applying a Fourier transform to a signal, the energy of a single sine wave is often spread out across multiple bins (aka smearing). Have a look at the right side of the image below for an illustration:
I want to extract a list of peak frequencies. Just finding the highest bin is easy. But after that smearing becomes a problem.
I would like to have a heuristic which tells me if the magnitude of a specific bin is possibly the result of smearing or if there has to be another peak frequency in order to explain the signal. (It is better if I miss some than have false positives)
My naive approach would be to just calculate a few thousand examples and take the maximum of these to get an envelope curve so that any smearing is likely below that envelope.
But is there a better way to do this?
The FFT result of any rectangularly windowed pure unmodulated sinusoid is a Sinc function. This Sinc (sin(pix)/(pix)) function is only zero for all bins except one (at the peak magnitude) when the frequency of the input sinusoid has exactly an integer number of periods in the FFT width.
For all other frequencies that are not at the exact center of an FFT result bin, if you know that exact frequency and magnitude (which won't be in any single FFT bin), you can calculate all the other bins by sampling the Sinc function.
And, of course, if the input sinusoid isn't perfectly pure, but modulated in any way (amplitude, frequency or phase), this modulation will produce various sidebands in the FFT result as well.

What is the difference between dense SIFT and HoG?

I am new to Computer Vision. I am studying Dense SIFT and HOG. For dense SIFT, the algorithm just considers every point as an interesting point and computes its gradient vector. HOG is another way to describe an image with a gradient vector.
I think Dense SIFT is a special case for HOG. In HoG, if we set the bin size to 8, for each window there are 4 blocks, for each block, there are 4 cells and the block stride is the same as the block size, we can still get a 128 dim vector for this window. And we can set any window stride to slide the window to detect the whole image. If the window stride of both these two algorithms is the same, they can get identical results.
I am not sure whether I am correct. Can anyone help me?
SIFT descriptor chooses a 16x16 and then divides it into 4x4 windows. Over each of these 4 windows it computes a Histogram of Oriented gradients. While computing this histogram, it also performs an interpolation between neighboring angles. Once you have all the 4x4 windows, it uses a gaussian of half the window size, centered at the center of the 16x16 block to weight the values in the whole 16x16 descriptor.
HoG on the other hand only computes a simple histogram of oriented gradients as the name says.
I feel that SIFT is more suited in describing the importance of a point, due to the gaussian weighting involved, while HoG does not have such a bias. Due to this reason, (ideally) HoG should be better suited at classification of images over dense SIFT, if all feature vectors are concatenated into one huge vector (this is my opinion, may not be true)

Horn-Schunck optical flow calculations

I was studying the Horn-schunck method for calculating optical flow in videos. My code is in C, which would mean i am implementing all of the algorithms from scratch including gray-scaling the image, computing derivatives etc. I am not able to completely absorb the essence of the method. The final flow matrix that I get would contain displacement vectors for each pixel, right? Meaning for each pixel, the value in the flow matrix would indicate the amount by which it is displaced in the next image.
How does this work out when I have all pixel values between 0-255, all my calculations are done on these pixel values and the resulting output gives displacement in, say, a 1920 X 1080 image.
The result of your method would be a matrix with two channels or two matrix, one for the u (or dx) direction/displacement and the other for the v (or dy) direction/displacement. That means you have a vector field
[u(x,y) v(x,y] = optical flow for each position (x,y) in your image
this vector field (the values of this field) have floating precision. That means e.g. u(0,0) = 0.2 v(0,0) = 0.13. Consequently in one part of your coud you have transform the gray-values of your input image into floating values. This is mostly done when you are compution the gradients e.g. with the sobel operator. The OpenCV library has a Horn-Schunk implementation. While reading the code takes some time, but you can be assured that this a very efficient way to implement this method.

CUDA cufftPlan2d plan size question

I'm studying the code behind the convolutionFFT2D example of the Nvidia CUDA sdk, but I don't get the point of this line:
cufftPlan2d(&fftPlan, fftH, fftW/2, CUFFT_C2C);
Apparently this initializes a complex plane for the FFT to be running in, but I don't see the point of dividing the plan width by 2.
Just to be precise: the fftH and fftW are rounded values for imageX+kernelX+1 and imageY+kernelY+1 dimensions (just for speed reasons). I know that in the frequency domain you usually have a positive component and a symmetric negative component of the same frequency.. but this sounds like cutting half of my image data away..
Can someone explain this to me a little better? I've never used a FFT (I just know the theory behind a fourier transformation)
When you perform a real to complex FFT half the frequency domain data is redundant due to symmetry. This is only the case in one axis of a 2D FFT though. You can think of a 2D FFT as two 1D FFT operations, the first operates on all the rows, and for a real valued image this will give you complex row values. In the second stage you apply a 1D FFT to every column, but since the row values are now complex this will be a complex to complex FFT with no redundancy in the output. Hence you only need width / 2 points in the horizontal axis, but you still need height pointe in the vertical axis.