CUDA CUFFT/IFFT Different results if i pad my data?

CUDA CUFFT/IFFT Different results if i pad my data? - c++

I have a signal that i am doing an FFT to, doing an convolution with itself and then an IFFT back to the time domain. The signal is 8192 long. If i pad the signal to 16384 (N*2) and perform the operations i get the correct output. However, is this necessary? But when i try and stick with 8192 using C2C FFT transforms I have similar data throughout until the IFFT. (When using 8192 it only has every 2nd point from the 16384 data).
I've run this through matlab and got the same results so i suspect its more to do with the maths than the implementation, but as I'm doing this in cuda, any advice is welcome, I don't mind having to pad the data in some shape if necessary but the data is fine up to the point i do the IFFT.
N.B I know i'm not doing all computation on the GPU, this was simply to remove errors and allow me to see what the code was doing.
Link to code: http://pasted.co/e2e5e625
This is what i should get
This is what i get if i don't pad.

I have a signal that I am doing an FFT to, doing an convolution with itself and then an IFFT back to the time domain.
Looking at your code, you are not doing a "convolution with itself" in the frequency domain but rather a multiplication by itself.
The entire sequence of operations (FFT, multiplication, IFFT) would correspond to computing the circular convolution of the signal with itself in the time domain. The circular convolution would only be equivalent to a linear convolution if the signal is first padded to a length at least 2*N-1 (which happens to be the minimum size required to store all linear convolution coefficients after the IFFT).
You may use a smaller FFT size (i.e. smaller than 2*N-1, but at least N) to compute a linear convolution by using the Overlap-add method.

Related

What is the fastest way to perform FFT on a large file?

I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).
Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.
I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.
I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.

Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.

Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.

Optimized image convolution algorithm

I am working on implementing Image convolution in C++, and I already have a naive working code based on the given pseudo code:
for each image row in input image:
for each pixel in image row:
set accumulator to zero
for each kernel row in kernel:
for each element in kernel row:
if element position corresponding* to pixel position then
multiply element value corresponding* to pixel value
add result to accumulator
endif
set output image pixel to accumulator
As this can be a big bottleneck with big Images and Kernels, I was wondering if there exist some other approach to make things faster ? even with additionnal input info like : sparse image or kernel, already known kernel etc...
I know this can be parallelized, but it's not doable in my case.

if element position corresponding* to pixel position then
I presume this test is meant to avoid a multiplication by 0. Skip the test! multiplying by 0 is way faster than the delays caused by a conditional jump.
The other alternative (and it's always better to post actual code rather than pseudo-code, here you have me guessing at what you implemented!) is that you're testing for out-of-bounds access. That is terribly expensive also. It is best to break up your loops so that you don't need to do this testing for the majority of the pixels:
for (row = 0; row < k/2; ++row) {
// inner loop over kernel rows is adjusted so it only loops over part of the kernel
}
for (row = k/2; row < nrows-k/2; ++row) {
// inner loop over kernel rows is unrestricted
}
for (row = nrows-k/2; row < nrows; ++row) {
// inner loop over kernel rows is adjusted
}
Of course, the same applies to loops over columns, leading to 9 repetitions of the inner loop over kernel values. It's ugly but way faster.
To avoid the code repetition you can create a larger image, copy the image data over, padded with zeros on all sides. The loops now do not need to worry about accessing out-of-bounds, you have much simpler code.
Next, a certain class of kernel can be decomposed into 1D kernels. For example, the well-known Sobel kernel results from the convolution of [1,1,1] and [1,0,-1]T. For a 3x3 kernel this is not a huge deal, but for larger kernels it is. In general, for a NxN kernel, you go from N2 to 2N operations.
In particular, the Gaussian kernel is separable. This is a very important smoothing filter that can also be used for computing derivatives.
Besides the obvious computational cost saving, the code is also much simpler for these 1D convolutions. The 9 repeated blocks of code we had earlier become 3 for a 1D filter. The same code for the horizontal filter can be re-used for the vertical one.
Finally, as already mentioned in MBo's answer, you can compute the convolution through the DFT. The DFT can be computed using the FFT in O(MN log MN) (for an image of size MxN). This requires padding the kernel to the size of the image, transforming both to the Fourier domain, multiplying them together, and inverse-transforming the result. 3 transforms in total. Whether this is more efficient than the direct computation depends on the size of the kernel and whether it is separable or not.

For small kernel size simple method might be faster. Also note that separable kernels (for example, Gauss kernel is separable) as mentioned, allow to make filtering by lines then by columns, resulting O(N^2 * M) complexity.
For other cases: there exists fast convolution based on FFT (Fast Fourier Transform). It's complexity is O(N^2*logN) (where N is size of image ) comparing to O(N^2*M^2) for naive implementation.
Of course, there some peculiarities in applying this techniques, for example, edge effects, but one needs to account for them in naive implementation too (in a lesser degree though).
FI = FFT(Image)
FK = FFT(Kernel)
Prod = FI * FK (element-by-element complex multiplication)
Conv(I, K) = InverseFFT(Prod)
Note that you can use some fast library intended for image filtering, for example, OpenCV allows to apply kernel to 1024x1024 image in 5-30 milliseconds.

One way to this speed up, might be, depending on target platform, to distinctly get every value in the kernel, then, in memory, store multiple copies of the image, one for every distinct value in the kernel, and multiply each copy of the image by its distinct kernel value, then at the end, multiply by distinct kernel value, shift, sum and divide up all the image copies into one image. This could be done on a graphics processor for example where memory is ample and which is more suited for this tight repetitive processing. The copies of the image will need to support overflow of the pixels, or you could use floating point values.

Is it possible to use a value in d[x] register as address in vld?

I have an Image with sizes M x N, and each pixel is 14 bits (all of them are stored in 16 bit integers but 2 least significant bits are not used). I want to map each pixel to an 8 bit value, due to a mapping function which is simply an array of 16384 values. I perform this image tone mapping using pure C++ as follows:
for(int i=0;i<imageSize;i++)
{
resultImage[i] = mappingArray[image[Index]];
}
However, I want to optimize this operation using ARM Neon intrinsics. Since there are 32 (correct it if I'm wrong) neon (dx) registers registers, I cannot use VTBL instruction for a lookup table larger than
8x32 = 256 elements. Moreover, there is another discussion on stacoverflow to use a lookup table larger than 32 bytes:
ARM NEON: How to implement a 256bytes Look Up table
How can I manage to optimize such simple looking operation? I think of using pixels of the image as address parameter of VLD function just as something like the following:
VLD1.8 {d1},[d0] ??
Is it possible? Or how can I handle this?

The optimization in the other example works by holding an entire lookup table in registers. You simply cannot do this: your table is 16384 bytes (2^14 -> 2^8), and that is way, way more than you have in register space.
Hence, your table will reside in L1 cache. The obvious C++ code:
unsigned char mappingArray[16384];
fill(mappingArray);
for(int i=0;i<imageSize;i++)
{
resultImage[i] = mappingArray[image[i]>>2];
}
will probably compile straight to the most efficient code. The problem isn't how you get things in registers. The problem is that you need memory access to your input image, mapping table and output image.
If speed was a problem, I'd solve this by aggressively trimming the table to perhaps 128 entries, and using linear interpolation on the next few bits.

Given a large look-up table, the normal process is to look very closely at it to figure out (or find on the internet) the algorithm to compute each entry. If that algorithm turns out to be simple enough then you might find that it's faster to perform the calculations in parallel rather than to perform scalar table look-ups.
Alternatively, based on the shape of the data you can try to find approximations which are up to requirements but which are easier to compute.
For example, you can use VTBL on the top three or four bits of the input, and linear interpolation on the rest. But this only works if the curve is smooth enough that linear interpolation is an adequate approximation.
A common operation which matches the parameters stated is linear to sRGB conversion; in which case you're looking at raising each input to the power of 5/12. That's a bit hairy, but you might still be able to get some performance gain if you don't need to be too accurate.

inverse fourier transform FFT3W

I am using C++ function to find inverse Fourier transform.
int inYSize = 170; int inXSize = 2280;
float* outData = new float[inYSize*inXSize];
fftwf_plan mReverse = fftwf_plan_dft_c2r_2d(inYSize, inXSize,(fftwf_complex*)temp, outdata,
FFTW_ESTIMATE);
fftwf_execute(mReverse);
My input is 2D array temp with complex numbers. All the elements have real value 1 and imaginary 0.
So I am expecting InverseFFT of such an array should be 2D array with real values. Output array should have SPIKE at 0,0 and rest all values 0. But I am getting all different values in the output array even after normalizing with total size of an array. What could be the reason?

FFTW is not that trivial to deal with when it comes to multidimensional DFT and Complex to Real transform.
When doing a C2R transform of a MxN row-major array, the second dimension is cut in half because of the symmetry of the result : outData is twice bigger than needed, but it's not the reason of your problem (and not you're case as you are doing C2R and not R2C).
More info about this tortuous matter : http://www.fftw.org/doc/One_002dDimensional-DFTs-of-Real-Data.html
"Good Guy Advice" : Use only the C2C "easier" way of doing things, take the modulus of the output if you don't know how to process the results, but don't waste your time on n-D Complex to Real transforms.
Because of limited precision, because of the numerical implementation of the DFT, because of unsubordinated drunk bits, you can get values that are not 0 even if they are very small. This is the normal behavior of a FFT algorithm.
Besides reading carefully the user manual (http://www.fftw.org/doc/) even if it's a real pain (I lost few days around this library just to get a 3D transform working, just to understand how data was scaled)
You should try with a C2C 1D transform before going C2C 2D and C2R 2D, just to be sure you have somehow an idea of what you're doing.
What's the inverse FFT of a planar constant something where every bin of the "frequency-plane" is filled with a one ? Are you looking for a new way to define +inf or -inf ? In that case I would rather start with the easier division by 0 ^^. The direct FFT should be a as you described, with the SPIKE correctly scaled being 1, pretty sure the inverse is not.
Do not hesitate to add precision to your question, and good luck with FFTW

With this little information it is hard to tell. What i could imagine would be that you get spectral leakage due to the window selection (See This Wikipedia article for details about leakage).
What you could do is try using another windowing function to reduce leakage or redefine your windowing size.

Calling multiple kernels, global memory performances - CUDA

I have four CUDA kernels working on matrices in the following way:
convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);
// A + B + C with CUBLAS' cublasSaxpy
every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).
Should I join these kernels into a single one by calling something like
multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)
?
Global memory should already be on the device so probably that would not help, but I'm not totally sure

If I understood correctly, you're asking if you should merge the three "multiplybyElement" kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.
Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).

If merging the kernels means that you can do only one pass over the memory, then you may see a 3x speedup.
Can you multiply up the fixed values up front and then do a single multiply in a single kernel?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js