Differences between FFTW and CUFFT output - c++

In the char I have posted below, I am comparing the results from an IFFT run in FFTW and CUFFT.
What are the possible reasons this is coming out different? Is it really THAT much round off error?
Here is the relevant code snippet:
cufftHandle plan;
cufftComplex *d_data;
cufftComplex *h_data;
cudaMalloc((void**)&d_data, sizeof(cufftComplex)*W);
complex<float> *temp = (complex<float>*)fftwf_malloc(sizeof(fftwf_complex) * W);
h_data = (cufftComplex *)malloc(sizeof(cufftComplex)*W);
memset(h_data, 0, W*sizeof(cufftComplex));
/* Create a 1D FFT plan. */
cufftPlan1d(&plan, W, CUFFT_C2C, 1);
if (!reader->getData(rowBuff, row))
return 0;
// copy from read buffer to our FFT input buffer
memcpy(indata, rowBuff, fCols * sizeof(complex<float>));
for(int c = 0; c < W; c++)
h_data[c] = make_cuComplex(indata[c].real(), indata[c].imag());
cutilSafeCall(cudaMemcpy(d_data, h_data, W* sizeof(cufftComplex), cudaMemcpyHostToDevice));
cufftExecC2C(plan, d_data, d_data, CUFFT_INVERSE);
cutilSafeCall(cudaMemcpy(h_data, d_data,W * sizeof(cufftComplex), cudaMemcpyDeviceToHost));
for(int c = 0; c < W; c++)
temp[c] =(cuCrealf(h_data[c]), cuCimagf(h_data[c]));
//execute ifft plan on "indata"
fftwf_execute(ifft);
...
//dump out abs() values of the first 50 temp and outdata values. Had to convert h_data back to a normal complex
ifft was defined like so:
ifft = fftwf_plan_dft_1d(freqCols, reinterpret_cast<fftwf_complex*>(indata),
reinterpret_cast<fftwf_complex*>(outdata),
FFTW_BACKWARD, FFTW_ESTIMATE);
and to generate the graph I dumped out h_data and outdata after the fftw_execute
W is the width of the row of the image I am processing.
See anything glaringly obvious?

So it looks like CUFFT is returning a real and imaginary part, and FFTW only the real. The cuCabsf() function that comes iwth the CUFFT complex library causes this to give me a multiple of sqrt(2) when I have both parts of the complex
As an aside - I never have been able to get exactly matching results in the intermediate steps between FFTW and CUFFT. If you do both the IFFT and FFT though, you should get something close.

Related

FFT and FFTShift of matlab in FFTW library c++

what is the exact equivalent of this MATLAB line code in C++ and using FFTW?
fftshift(fft(x,4096)));
note: X is an array of 4096 double data.
now I use these lines of code in c++ and FFTW to compute fft
int n = 4096
fftw_complex *x;
fftw_complex *y;
x = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * n);
y = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * n);
for (int i=0; i<n; i++)
{
x[i][REAL] = MyDoubleData[i];
x[i][IMAG] = 0;
}
fftw_plan plan = fftw_plan_dft_1d(n, x, y, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
fftw_cleanup();
It is just equivalent of FFT function in MATLAB.
Is there any equivalent function for FftShift in FFTW library?
The FFTW function calls you've provided would be the equivalent of fft(x,4096). If x is real, matlab knows to give you the conjugate symmetric FFT (I think). If you want to do this with FFTW, you need to use the r2c and c2r functions (real-to-complex/complex-to-real).
You have to do the shift yourself. You can do direct substitution (poor performance, but should be intuitive)
for (int i=0; i<n; i++)
{
fftw_complex tmp;
int src = i;
int dst = (i + n/2 - 1) % n;
tmp=y[src];
y[src]=x[dst];
y[dst]=tmp;
}
Alternatively use a couple memcpy's (and/or memmove's) or modify your input data
The output of the fftw is stored base on following format of frequencies sequence:
[0....N-1]
Where N is number of frequencies and is oven.
And fftshift change it to:
[-(N-1)/2,..., 0..., (N-1)/2]
but you should note that fftw output has equivalent as:
[0,.., N-1] is same as [0,...,(N-1)/2,-(N-1)/2,...,-1]
This means that in DFT, frequency -i is same as N-i.

CUDA cudaMemCpy doesn't appear to copy despite CudaSuccess

I'm just starting with CUDA and this is my very first project. I've done a search for this issue and while I've noticed other people have had similar problems, none of the suggestions seemed relevant to my specific issue or have helped in my case.
As an exercise, I'm trying to write an n-body simulation using CUDA. At this stage I'm not interested whether my specific implementation is efficient or not, I'm just looking for something that works and I can refine it later. I'll also need to update the code later, once it's working, to work on my SLI configuration.
Here's a brief outline of the process:
Create X and Y position, velocity, acceleration vectors.
Create same vectors on GPU and copy values across
In a loop: (i) calculate acceleration for the iteration, (ii) apply acceleration to velocities and positions, and (iii) copy positions back to host for display.
(Display not implemented yet. I'll do this later)
Don't worry about the acceleration calculation function for now, here is the update function:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
vel_x[i] += acc_x[i];
vel_y[i] += acc_y[i];
pos_x[i] += vel_x[i];
pos_y[i] += vel_y[i];
}
}
And here's some of the code in the main method:
cudaError t;
t = cudaMalloc(&d_pos_x, N * sizeof(double));
t = cudaMalloc(&d_pos_y, N * sizeof(double));
t = cudaMalloc(&d_vel_x, N * sizeof(double));
t = cudaMalloc(&d_vel_y, N * sizeof(double));
t = cudaMalloc(&d_acc_x, N * sizeof(double));
t = cudaMalloc(&d_acc_y, N * sizeof(double));
t = cudaMemcpy(d_pos_x, pos_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_pos_y, pos_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_x, vel_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_y, vel_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_x, acc_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_y, acc_y, N * sizeof(double), cudaMemcpyHostToDevice);
while (true)
{
calc_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
apply_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
t = cudaMemcpy(pos_x, d_pos_x, N * sizeof(double), cudaMemcpyDeviceToHost);
t = cudaMemcpy(pos_y, d_pos_y, N * sizeof(double), cudaMemcpyDeviceToHost);
std::cout << pos_x[0] << std::endl;
}
Every loop, cout writes the same value, whatever random value it was set to when the position arrays were original created. If I change the code in apply_acc to something like:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
pos_x[i] += 1.0;
pos_y[i] += 1.0;
}
}
then it still gives the same value, so either apply_acc isn't being called or the cudaMemcpy isn't copying the data back.
All the cudaMalloc and cudaMemcpy calls return cudaScuccess.
Here's a PasteBin link to the complete code. It should be fairly simple to follow as there's a lot of repetition for the various arrays.
Like I said, I've never written CUDA code before, and I wrote this based on the #2 CUDA example video from NVidia where the guy writes the parallel array addition code. I'm not sure if it makes any difference, but I'm using 2x GTX970's with the latest NVidia drivers and CUDA 7.0 RC, and I chose not to install the bundled drivers when installing CUDA as they were older than what I had.
This won't work:
const int N = 100000;
...
calc_acc<<<1, N>>>(...);
apply_acc<<<1, N>>>(...);
The second parameter of a kernel launch config (<<<...>>>) is the threads per block parameter. It is limited to either 512 or 1024 depending on how you are compiling. These kernels will not launch, and the type of error this produces needs to be caught by using correct CUDA error checking. Simply looking at the return values of subsequent CUDA API functions will not indicate the presence of this type of error (which is why you are seeing cudaSuccess subsequently).
Regarding the concept itself, I suggest you learn more about CUDA thread and block hierarchy. To launch a large number of threads, you need to use both parameters (i.e. niether of the first two parameters should be 1) of the kernel launch config. This is usually advisable from a performance perspective as well.

How to get the real and imaginary parts of a complex matrix separately in CUDA?

I'm trying to get the fft of a 2D array. The input is a NxM real matrix, therefore the output matrix is also a NxM matrix (2xNxM output matrix which is complex is saved in a NxM matrix using the property Hermitian symmetry).
So i want to know whether there is method to extract in cuda to extract real and complex matrices separately ? In opencv split function does the duty. So I'm looking for a similar function in cuda, but I couldn't find it yet.
Given below is my complete code
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
int main()
{
const size_t NX = 4;
const size_t NY = 5;
// Input array - host side
float b[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067}
};
// Output array - host side
float c[NX][NY] = { 0 };
cufftHandle plan;
cufftComplex *data; // Holds both the input and the output - device side
int n[NRANK] = {NX, NY};
// Allocated memory and copy from host to device
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
for(int i=0; i<NX; ++i){
// Uses this because my actual array is a dynamically allocated.
// but here I've replaced it with a static 2D array to make it simple.
cudaMemcpy(reinterpret_cast<float*>(data) + i*NY, b[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
// Performe the fft
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
// Here c is a NxM matrix. I want to split it to 2 seperate NxM matrices with each
// having the complex and real component of the output
// Here c is in
cufftDestroy(plan);
cudaFree(data);
return 0;
}
EDIT
As suggested by JackOLanter, I modified the code as below. But still the problem is not solved.
float real_vec[NX][NY] = {0}; // host vector, real part
float imag_vec[NX][NY] = {0}; // host vector, imaginary part
cudaError cudaStat1 = cudaMemcpy2D (real_vec, sizeof(real_vec[0]), data, sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
cudaError cudaStat2 = cudaMemcpy2D (imag_vec, sizeof(imag_vec[0]),data + 1, sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
The error i get is 'invalid pitch argument error'. But i can't understand why. For the destination I use a pitch size of 'float' while for the source i use size of 'float2'
Your question and your code do not make much sense to me.
You are performing a batched FFT, but it seems you are not foreseeing enough memory space neither for the input, nor for the output data;
The output of cufftExecR2C is a NX*(NY/2+1) float2 matrix, which can be interpreted as a NX*(NY+2) float matrix. Accordingly, you are not allocating enough space for c (which is only NX*NY float) for the last cudaMemcpy. You would need still one complex memory location for the continuous component of the output;
Your question does not seem to be related to the cufftExecR2C command, but is much more general: how can I split a complex NX*NY matrix into 2 NX*NY real matrices containing the real and imaginary parts, respectively.
If I correctly interpret your question, then the solution proposed by #njuffa at
Copying data to “cufftComplex” data struct?
could be a good clue to you.
EDIT
In the following, a small example on how "assembling" and "disassembling" the real and imaginary parts of complex vectors when copying them from/to host to/from device. Please, add your own CUDA error checking.
#include <stdio.h>
#define N 16
int main() {
// Declaring, allocating and initializing a complex host vector
float2* b = (float2*)malloc(N*sizeof(float2));
printf("ORIGINAL DATA\n");
for (int i=0; i<N; i++) {
b[i].x = (float)i;
b[i].y = 2.f*(float)i;
printf("%f %f\n",b[i].x,b[i].y);
}
printf("\n\n");
// Declaring and allocating a complex device vector
float2 *data; cudaMalloc((void**)&data, sizeof(float2)*N);
// Copying the complex host vector to device
cudaMemcpy(data, b, N*sizeof(float2), cudaMemcpyHostToDevice);
// Declaring and allocating space on the host for the real and imaginary parts of the complex vector
float* cr = (float*)malloc(N*sizeof(float));
float* ci = (float*)malloc(N*sizeof(float));
/*******************************************************************/
/* DISASSEMBLING THE COMPLEX DATA WHEN COPYING FROM DEVICE TO HOST */
/*******************************************************************/
float* tmp_d = (float*)data;
cudaMemcpy2D(cr, sizeof(float), tmp_d, 2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
cudaMemcpy2D(ci, sizeof(float), tmp_d+1, 2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
printf("DISASSEMBLED REAL AND IMAGINARY PARTS\n");
for (int i=0; i<N; i++)
printf("cr[%i] = %f; ci[%i] = %f\n",i,cr[i],i,ci[i]);
printf("\n\n");
/******************************************************************************/
/* REASSEMBLING THE REAL AND IMAGINARY PARTS WHEN COPYING FROM HOST TO DEVICE */
/******************************************************************************/
cudaMemcpy2D(tmp_d, 2*sizeof(float), cr, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
cudaMemcpy2D(tmp_d + 1, 2*sizeof(float), ci, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
// Copying the complex device vector to host
cudaMemcpy(b, data, N*sizeof(float2), cudaMemcpyHostToDevice);
printf("REASSEMBLED DATA\n");
for (int i=0; i<N; i++)
printf("%f %f\n",b[i].x,b[i].y);
printf("\n\n");
getchar();
return 0;
}

CUFFT output not aligned the same as FFTW output

I am doing a 1D FFT. I have the same input data as would go in FFTW, however, the return from CUFFT does not seem to be "aligned" the same was FFTW is. That is, In my FFTW code, I could calculate the center of the zero padding, then do some shifting to "left-align" all my data, and have trailing zeros.
In CUFFT, the result from the FFT is data that looks like it is the same, however, the zeros are not "centered" in the output, so the rest of my algorithm breaks. (The shifting to left-align the data still has a "gap" in it after the bad shift).
Can anyone give me any insight? I thought it had something to do with those compatibility flags, but even with cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_FFTW_ALL); I am still getting a bad result.
Below is a plot of the magnitude of the data from the first row. The data on the left is the output of the inverse CUFFT, and the output on the right is the output of the inverse FFTW.
Thanks!
Here is the setup code for the FFTW and CUFFT plans
ifft = fftwf_plan_dft_1d(freqCols, reinterpret_cast<fftwf_complex*>(indata),
reinterpret_cast<fftwf_complex*>(outdata),
FFTW_BACKWARD, FFTW_ESTIMATE);
CUFFT:
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_FFTW_ALL);
cufftPlan1d(&plan, width, CUFFT_C2C, height);
and executing code:
fftwf_execute(ifft);
CUFFT:
cufftExecC2C(plan, d_image, d_image, CUFFT_INVERSE); //in place inverse
Completed some test code:
complex<float> *input = (complex<float>*)fftwf_malloc(sizeof(fftwf_complex) * 100);
complex<float> *output = (complex<float>*)fftwf_malloc(sizeof(fftwf_complex) * 100);
fftwf_plan ifft;
ifft = fftwf_plan_dft_1d(100, reinterpret_cast<fftwf_complex*>(input),
reinterpret_cast<fftwf_complex*>(output),
FFTW_BACKWARD, FFTW_ESTIMATE);
cufftComplex *inplace = (cufftComplex *)malloc(100*sizeof(cufftComplex));
cufftComplex *d_inplace;
cudaMalloc((void **)&d_inplace,100*sizeof(cufftComplex));
for(int i = 0; i < 100; i++)
{
inplace[i] = make_cuComplex(cos(.5*M_PI*i),sin(.5*M_PI*i));
input[i] = complex<float>(cos(.5*M_PI*i),sin(.5*M_PI*i));
}
cutilSafeCall(cudaMemcpy(d_inplace, inplace, 100*sizeof(cufftComplex), cudaMemcpyHostToDevice));
cufftHandle plan;
cufftPlan1d(&plan, 100, CUFFT_C2C, 1);
cufftExecC2C(plan, d_inplace, d_inplace, CUFFT_INVERSE);
cutilSafeCall(cudaMemcpy(inplace, d_inplace, 100*sizeof(cufftComplex), cudaMemcpyDeviceToHost));
fftwf_execute(ifft);
When I dumped the output from both of these FFT calls, it did look the same. I am not exactly sure what I was looking at though. The data had a value of 100 in the 75th row. Is that correct?
It looks like you may have swapped the real and imaginary components of your complex data in the input to one of the IFFTs. This swap will change an even function to an odd function in the time domain.

Violation access in time compilation (0xC0000005)

The process I want to do is to make the FFT to an image (stored in “imagen”) , and then, multiply it with a filter ‘H’, after that, the inverse FFT will be done also.
The code is shown below:
int ancho;
int alto;
ancho=ui.imageframe->imagereader->GetBufferedRegion().GetSize()[0]; //ancho=widht of the image
alto=ui.imageframe->imagereader->GetBufferedRegion().GetSize()[1]; //alto=height of the image
double *H ;
H =matrix2D_H(ancho,alto,eta,sigma); // H is calculated
// We want to get: F= fft(f) ; H*F ; f'=ifft(H*F)
// Inicialization of the neccesary elements for the calculation of the fft
fftw_complex *out;
fftw_plan p;
int N= (ancho/2+1)*alto; //number of points of the image
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*N);
double *in = (double*) imagen.GetPointer(); // conversion of itk.smartpointer --> double*
p = fftw_plan_dft_r2c_2d(ancho, alto, in, out, FFTW_ESTIMATE); // FFT planning
fftw_execute(p); // FFT calculation
/* Multiplication of the Output of the FFT with the Filter H*/
int a = alto;
int b = ancho/2 +1; // The reason for the second dimension to have this value is that when the FFT calculation of a real image is performed only the non-redundants outputs are calculated, that’s the reason for the output of the FFT and the filter ‘H’ to be equal.
// Matrix point-by-point multiplicaction: [axb]*[axb]
fftw_complex* res ; // result will be stored here
res = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*a*b);
res = multiply_matrix_2D(out,H, a, b);
The problem is located here, in the loop inside the function ‘multiply_matrix_2D’:
fftw_complex* prueba_r01::multiply_matrix_2D(fftw_complex* out, double* H, int M ,int N){
/* The matrix out[MxN] or [n0x(n1/2)+1] is the image after the FFT , and the out_H[MxN] is the filter in the frequency domain,
both are multiplied POINT TO POINT, it has to be called twice, one for the imaginary part and another for the normal part
*/
fftw_complex *H_cast;
H_cast = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
H_cast= reinterpret_cast<fftw_complex*> (H); // casting from double* to fftw_complex*
fftw_complex *res; // the result of the multiplication will be stored here
res = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
//Loop for calculating the matrix point-to-point multiplication
for (int x = 0; x<M ; x++){
for (int y = 0; y<N ; y++){
res[x*N+y][0] = out[x*N+y][0]*(H_cast[x*N+y][0]+H_cast[x*N+y][1]);
res[x*N+y][1] = out[x*N+y][1]*(H_cast[x*N+y][0]+H_cast[x*N+y][1]);
}
}
fftw_free(H_cast);
return res;
}
With the values of x = 95 and y = 93 being M = 191 and N = 96;
Uncontroled exception at 0x004273ab in prueba_r01.exe: 0xC0000005 acess infraction reading 0x01274000.
imagen http://img846.imageshack.us/img846/4585/accessviolationproblem.png
Where a lot of values of the variables are in red, and for translation issue: H_cast[][1] has in the value box : “Error30CXX0000 : impossible to evaluate the expression”.
I will really appreciate any kind of help with this please!!
Antonio
This part of the code
H_cast = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
H_cast= reinterpret_cast<fftw_complex*> (H); // casting from double* to fftw_complex*
first allocates a new buffer for H_cast and then immediately sets it to point to the original H instead. It doesn't copy the data, just the pointer.
At the end of the function some buffer is free'd
fftw_free(H_cast);
which seems to free the data pointed to by H and not the buffer allocated in the function.
When getting back to the caller, the H there is lost!
There is an FFT class inside of ITK that can use fftw (USE_FFTW) from cmake for configuration. This class describes how to reference the ITK raw buffer memory from fftw.
PS: The upcoming ITKv4 has greatly improved the fftw compatibility.