I implemented a fftw (fftw.org) example to use Fast Fourier transforms...
This is the code....
I load an image that I convert from uint8_t to double (this code works fine...).
string bmpFileNameImage = "files/testDummyFFTWWithWisdom/onechannel_image.bmp";
BMPImage bmpImage(bmpFileNameImage);
vector<uint8_t> image = bmpImage.copyBits();
toDouble(image,pixelColors,256,256, 1);
int width = bmpImage.width();
int height = bmpImage.height();
I use wisdom files to improve the performance
FILE * file = fopen("wisdom.fftw", "r");
if (file) {
///* fftw variables */
fftw_complex *out;
double *wisdomInput = (double *) fftw_malloc(sizeof(double)*width*2*(height/2 +1 ));
const fftw_plan forward =fftw_plan_dft_r2c_2d(width,height, wisdomInput,reinterpret_cast<fftw_complex *>(wisdomInput),FFTW_PATIENT);
const fftw_plan inverse = fftw_plan_dft_c2r_2d(width, height,reinterpret_cast<fftw_complex *>(wisdomInput),wisdomInput, FFTW_PATIENT);
file = fopen("wisdom.fftw", "w");
if (file) {
Finally, I execute the fftw library.... I receive an Access violation error with the first
function (fftw_execute_dft_r2c) and I don't know why... I read this tutorial:
I do a malloc with (ny/2+1) how it is explained.... . I don't understand why it is not working.... I am testing different sizes...
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * width *(height / 2 + 1));
double *result =(double *)fftw_malloc(width * (height+2) * sizeof(double));
This is the corrected code.
It had a few mistakes:
It was reading a wrong wisdom.fftw file (from some old test...). Now, It always creates a new fftw_plan and a new file.
I misunderstood how it works the fftw library with in-place and out-of-place parameters. I had to change mallocs for the correct padding for "in-place" (I added +2 in malloc functions).
In order to restore the image, I had to divide by its size ((width+2) * height) how it is explained in this link.
/* load image */
string bmpFileNameImage = "files/polyp.bmp";
BMPImage bmpImage(bmpFileNameImage);
int width = bmpImage.width();
int height = bmpImage.height();
vector<double> pixelColors;
vector<uint8_t> image = bmpImage.copyBits();
//get one channel from the image
//We don't reuse old wisdom.fftw... It can be corrupt
FILE * file = fopen("wisdom.fftw", "r");
if (file) {
} */
double *wisdomInput = (double *) fftw_malloc(sizeof(double)*height*(width+2));
const fftw_plan forward =fftw_plan_dft_r2c_2d(width,height,wisdomInput,reinterpret_cast<fftw_complex *>(wisdomInput),FFTW_PATIENT);
const fftw_plan inverse = fftw_plan_dft_c2r_2d(width,height,reinterpret_cast<fftw_complex *>(wisdomInput),wisdomInput, FFTW_PATIENT);
double *bitsColors =(double *)fftw_malloc((width) * height * sizeof(double));
for (int y = 0; y < height; y++) {
for (int x = 0; x < width+2; x++) {
if (x < width) {
int currentIndex = ((y * width) + (x));
bitsColors[currentIndex] = (static_cast<double>(result[y * (width+2) + x])) / (height*width);
fftw_free (wisdomInput);
fftw_free (out);
fftw_free (result);
fftw_free (bitsColors);
What are you doing here ? The array has already a pointer.
Change it to fftw_execute_dft_r2c(forward,pixelColors[0],out); it should work now.
Maybe the problem is here (http://www.fftw.org/doc/New_002darray-Execute-Functions.html):
[...] that the following conditions are met:
The input and output arrays are the same (in-place) or different (out-of-place) if the plan was originally created to be in-place or
out-of-place, respectively.
In the plan you are using in-place transformation parameters (with bad allocation, BTW, since:
double *wisdomInput = (double *) fftw_malloc(sizeof(double)*width*2*(height/2 +1 ));
should be:
double *wisdomInput = (double *) fftw_malloc(sizeof(fftw_complex)*width*2*(height/2 +1 ));
to be suitable for output too).
But you're calling fftw_execute_dft_r2c function with out-of-place parameters.
So I have this piece of code:
if(channels == 3)
type = CV_32FC3;
type = CV_32FC1;
cv::Mat M(rows,cols,type);
std::cout<<"Cols:"<<cols<<" ColsMat:"<<M.cols<<std::endl;
float * source_data = (float*) M.data;
// copying the data into the corresponding pixel
for (int r = 0; r < rows; r++)
float* source_row = source_data + (r * rows * channels);
for (int c = 0; c < cols ; c++)
float* source_pixel = source_row + (c * channels);
for (int ch = 0; ch < channels; ch++)
std::cout<<"Row:"<<r<<" Col:"<<c<<" Channel:"<<ch<<std::endl;
std::cout<<"Type check: "<<typeid(T_M(0,r,c,ch)).name()<<std::endl;
float* source_value = source_pixel + ch;
*source_value = T_M(0, r, c, ch);
T_M is an Eigen::Tensor
I first thought that I got the error from T_M but it isn't the case.
I tried accessing *source_value and I am mostly sure that is the source of the error.
Funny thing is that I don't get the error in the end or the beginning. I get the seg fault around the middle.
For example, with rows: 915, cols: 793, and channels:1
I get the error at Row:829 Col:729 Channel:0.
I can't figure out the source of this error.
you compute your row pointer wrong, should be cols instead of rows:
float* source_row = source_data + (r * cols * channels);
In general, you must be very careful when you use a flat representation of a matrix, it's really error-prone.
The answer from Jean-François Fabre will work, if the matrix is continuous. If you can't be sure about that (e.g. if the matrix is provided by someone else, if you use submatrixes, etc.), you should use the widthstep feature to compute the row pointer:
float* source_row = (float*)(M.data + r*M.step);
this automatically uses the right number of channels, padding, etc.
even simpler is to use the row-ptr function directly:
float* source_row = (float*)(M.ptr(r));
i have a problem with the dft algorithm of the fftw library.
All i want to do is to transform a certain pattern forward and backward to receive the input pattern again, of course there will be some sort of filtering in between the transformations later on.
So, what my program does atm is:
Create a test signal
Filter or "window" the test signal with a value of 1.0 or 0.5
Copy the test signal to a fftw_complex data type
Perform a forward and backward dft
Calculate the magnitude, which is called phase here
Copy and adjust data for display purposes, and finally display the images via OpenCV
My problem is that when is use no filtering my backward transformed image is wrapped somehow and i can't calculate the correct magnitude, which should be indentical to my input image / test signal.
When i set the fitler/"window" to a value of 0.5 the backward transformation works fine, but my input image is just half as bright as it should be.
The following image illustrates my problem: (from top left to bottom right)
1. Input signal, 2. Real part of backward transformation, 3. From backward transformated data calculated magnitude, 4. Input signal multiplied with 0.5, 5. Real part of backward transformation, 6. From backward transformated data calculated magnitude.
Does anybody have an idea why the dft performs in that way?! It's kind of strange...
My code looks like this atm:
/***** parameters **************************************************************************/
int imSize = 256;
int imN = imSize * imSize;
char* interferogram = new char[imN];
double* spectrumReal = new double[imN];
double* spectrumImaginary = new double[imN];
double* outputReal = new double[imN];
double* outputImaginary = new double[imN];
double* phase = new double[imN];
char* spectrumRealChar = new char[imN];
char* spectrumImaginaryChar = new char[imN];
char* outputRealChar = new char[imN];
char* outputImaginaryChar = new char[imN];
char* phaseChar = new char[imN];
Mat interferogramMat = Mat(imSize, imSize, CV_8U, interferogram);
Mat spectrumRealCharMat = Mat(imSize, imSize, CV_8U, spectrumRealChar);
Mat spectrumImaginaryCharMat = Mat(imSize, imSize, CV_8U, spectrumImaginaryChar);
Mat outputRealCharMat = Mat(imSize, imSize, CV_8U, outputRealChar);
Mat outputImaginaryCharMat = Mat(imSize, imSize, CV_8U, outputImaginaryChar);
Mat phaseCharMat = Mat(imSize, imSize, CV_8U, phaseChar);
/***** compute interferogram ****************************************************************/
fill_n(interferogram, imN, 0);
double value = 0;
double window = 0;
for (int y = 0; y < imSize; y++)
for (int x = 0; x < imSize; x++)
value = 127.5 + 127.5 * cos((2*PI) / 10000 * (pow(double(x - imSize/2), 2) + pow(double(y - imSize/2), 2)));
window = 1;
value *= window;
interferogram[y * imSize + x] = (unsigned char)value;
/***** create fftw arays and plans **********************************************************/
fftw_complex* input;
fftw_complex* spectrum;
fftw_complex* output;
fftw_plan p_fw;
fftw_plan p_bw;
input = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * imN);
spectrum = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * imN);
output = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * imN);
p_fw = fftw_plan_dft_2d(imSize, imSize, input, spectrum, FFTW_FORWARD, FFTW_ESTIMATE);
p_bw = fftw_plan_dft_2d(imSize, imSize, spectrum, output, FFTW_BACKWARD, FFTW_ESTIMATE);
/***** copy data ****************************************************************************/
for (int i = 0; i < imN; i++)
input[i][0] = double(interferogram[i]) / 255.;
input[i][1] = 0.;
spectrum[i][0] = 0.;
spectrum[i][1] = 0.;
output[i][0] = 0.;
output[i][1] = 0.;
/***** FPS algorithm ************************************************************************/
for (int i = 0; i < imN; i++)
phase[i] = sqrt(pow(output[i][0], 2) + pow(output[i][1], 2));
/***** copy data ****************************************************************************/
for (int i = 0; i < imN; i++)
spectrumReal[i] = spectrum[i][0];
spectrumImaginary[i] = spectrum[i][1];
outputReal[i] = output[i][0] / imN;
outputImaginary[i] = output[i][1];
SaveCharImage(interferogram, imN, "01_interferogram_512px_8bit.raw");
SaveDoubleImage(spectrumReal, imN, "02_spectrum_real_512px_64bit.raw");
SaveDoubleImage(spectrumImaginary, imN, "03_spectrum_imaginary_512px_64bit.raw");
SaveDoubleImage(outputReal, imN, "03_output_real_512px_64bit.raw");
DoubleToCharArray(spectrumReal, spectrumRealChar, imSize);
DoubleToCharArray(spectrumImaginary, spectrumImaginaryChar, imSize);
DoubleToCharArray(outputReal, outputRealChar, imSize);
DoubleToCharArray(outputImaginary, outputImaginaryChar, imSize);
DoubleToCharArray(phase, phaseChar, imSize);
/***** show images **************************************************************************/
imshow("interferogram", interferogramMat);
imshow("spectrum real", spectrumRealCharMat);
imshow("spectrum imaginary", spectrumImaginaryCharMat);
imshow("out real", outputRealCharMat);
imshow("out imaginary", outputImaginaryCharMat);
imshow("phase", phaseCharMat);
int key = waitKey(0);
Here are some lines of your code :
char* interferogram = new char[imN];
double value = 0;
double window = 0;
for (int y = 0; y < imSize; y++)
for (int x = 0; x < imSize; x++)
value = 127.5 + 127.5 * cos((2*PI) / 10000 * (pow(double(x - imSize/2), 2) + pow(double(y - imSize/2), 2)));
window = 1;
value *= window;
interferogram[y * imSize + x] = (unsigned char)value;
The problem is that a char is between -128 and 127, while unsigned char ranges from 0 to 255. In interferogram[y * imSize + x] = (unsigned char)value;, there is an implicit cast to char.
It does not affect the output if window=0.5, but it triggers a change if window=1 as value becomes higher than 127. This is exactly the problem that you noticed in your question !
It does not affect the first displayed image since CV_8U corresponds to unsigned char : interferogram is therefore cast back into a unsigned char*. Take a look at Can I turn unsigned char into char and vice versa? to know more about char to unsigned char cast.
The problem occurs at input[i][0] = double(interferogram[i]) / 255.; : if window=1, interferogram[i] may be negative and input[i][0] becomes negative.
Change all char to unsigned char and it should solve the problem.
You may also change
outputReal[i] = output[i][0] / imN;
outputImaginary[i] = output[i][1];
outputReal[i] = output[i][0];
outputImaginary[i] = output[i][1];
Calls to fftw seems to be fine.
I need each thread to write and read a private location in global memory. Below I post a working code showing my problem. In the following, I'll list the main variables and structures involved.
srcArr_h (host) --> srcArr_d (device) : array of random floats in the range [0, COLORLEVELS] with dimensions given by ARRDIM
auxD (device) : array of dimension ARRDIM * ARRDIM holding the final result in device
auxH (host) : array of dimension ARRDIM * ARRDIM holding the final result in host
c_glob_d (device) : array that reserves a private location of COLORLEVELS floats for each thread, with size given by num_threads * COLORLEVELS
idx (device) : identification number of current thread
My problem: in the kernel, I update c_glob[idx] for each value ic (ic∈ [0, COLORLEVELS]), i.e. c_glob[idx][ic]. I use c_glob[idx][COLORLEVELS] to compute the final result g0 stored in auxD. My problem is that my final results are wrong. Results copied to auxH show that I get numbers at least one order of magnitude bigger then expected or even weird numbers suggesting my operation is likely to overflow.
Help: what am I doing wrong? How can I make each thread to write and read each private location in global memory? Right now I'm debugging with ARRDIM = 512, but my goal is to make it work for ARRDIM~ 10^4, thus creating a c_glob array for 10^4*10^4 threads). I guess I will have issues with the total number of threads allowed per run.. So I was wondering if you could suggest any other solution to my problem.
Thank you.
#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
#include "cuPrintf.cu"
using namespace std;
#define ARRDIM 512
__global__ void gpuKernel
float *sa, float *aux,
size_t memPitchAux, int w,
float *c_glob
float sc_loc[COLORLEVELS];
float g0=0.0f;
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int idx = tidy * memPitchAux/4 + tidx;
for(int ic=0; ic<COLORLEVELS; ic++)
sc_loc[ic] = ((float)(ic*ic));
for(int is=0; is<COLORLEVELS; is++)
int ic = fabs(sa[tidy*w +tidx]);
c_glob[tidy * COLORLEVELS + tidx + ic] += 1.0f;
for(int ic=0; ic<COLORLEVELS; ic++)
g0 += c_glob[tidy * COLORLEVELS + tidx + ic]*sc_loc[ic];
aux[idx] = g0;
int main(int argc, char* argv[])
* array src host and device
int heightSrc = ARRDIM;
int widthSrc = ARRDIM;
float *srcArr_h, *srcArr_d;
size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;
srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero
int totArrElm = heightSrc*widthSrc;
for(int ic=0; ic<totArrElm; ic++)
srcArr_h[ic] = (float)(rand() % COLORLEVELS);
cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);
* auxiliary buffer auxD to save final results
float *auxD;
size_t auxDPitch;
cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);
* auxiliary buffer auxH allocation + initialization on host
size_t auxHPitch;
auxHPitch = widthSrc*sizeof(float);
float *auxH = (float *) malloc(heightSrc*auxHPitch);
* kernel launch specs
int thpb_x = 16;
int thpb_y = 16;
int blpg_x = (int) widthSrc/thpb_x;
int blpg_y = (int) heightSrc/thpb_y;
int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;
* c_glob: array that reserves a private location of COLORLEVELS floats for each thread
int cglob_w = COLORLEVELS;
int cglob_h = num_threads;
float *c_glob_d;
size_t c_globDPitch;
cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);
* kernel launch
dim3 dimBlock(thpb_x,thpb_y, 1);
dim3 dimGrid(blpg_x,blpg_y,1);
gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);
auxHPitch, heightSrc,
float min = auxH[0];
float max = auxH[0];
float f;
string str;
for(int i=0; i<widthSrc*heightSrc; i++)
if(min > auxH[i])
min = auxH[i];
if(max < auxH[i])
max = auxH[i];
You decided neither not to show the whole code nor a reduced size thereof reproducing your problem. Therefore, it has not been possible to make tests and verify the possible solution below.
I think you have spot the source of the problem: multiple threads are trying to write to the same memory locations in parallel. This is a situation leading to race conditions. For an example, see the fourth slide of the presentation "CUDA C: race conditions, atomics, locks, mutex, and warps".
Race conditions have a brute-force solution: atomic functions. They are described at Section B.12 of the CUDA C Programming Guide. So you can try to fix your problem by changing the line
c[ic] += 1.0f;
You will pay this fix with performance: atomic operations serialize the code to avoid race conditions.
I have mentioned that atomic functions are a brute-force solution to your problem because it can be that, by properly rethinking the implementation, you can find a way to avoid them. But this is not possible to say as of now due to the very few details you provided.
I am having trouble understanding a particular area of code in the Steinberg VST Synth example
In this function:
void VstXSynth::processReplacing (float** inputs, float** outputs, VstInt32 sampleFrames)
float* out1 = outputs[0];
float* out2 = outputs[1];
if (noteIsOn)
float baseFreq = freqtab[currentNote & 0x7f] * fScaler;
float freq1 = baseFreq + fFreq1; // not really linear...
float freq2 = baseFreq + fFreq2;
float* wave1 = (fWaveform1 < .5) ? sawtooth : pulse;
float* wave2 = (fWaveform2 < .5) ? sawtooth : pulse;
float wsf = (float)kWaveSize;
float vol = (float)(fVolume * (double)currentVelocity * midiScaler);
VstInt32 mask = kWaveSize - 1;
if (currentDelta > 0)
if (currentDelta >= sampleFrames) // future
currentDelta -= sampleFrames;
memset (out1, 0, currentDelta * sizeof (float));
memset (out2, 0, currentDelta * sizeof (float));
out1 += currentDelta;
out2 += currentDelta;
sampleFrames -= currentDelta;
currentDelta = 0;
// loop
while (--sampleFrames >= 0)
// this is all very raw, there is no means of interpolation,
// and we will certainly get aliasing due to non-bandlimited
// waveforms. don't use this for serious projects...
(*out1++) = wave1[(VstInt32)fPhase1 & mask] * fVolume1 * vol;
(*out2++) = wave2[(VstInt32)fPhase2 & mask] * fVolume2 * vol;
fPhase1 += freq1;
fPhase2 += freq2;
memset (out1, 0, sampleFrames * sizeof (float));
memset (out2, 0, sampleFrames * sizeof (float));
The way I understand the function is that if a midi note is currently on, we need to copy our wave table into the outputs array to pass back to the VstHost. What I don't understand specifically is what the area in the if (currentDelta > 0) conditional block is doing. It seems like its just writing zeros to the output arrays...
A full version of the file can be found at http://pastebin.com/SdAXkRyW
The incomming MIDI NoteOn event can have an offset relative to the start of the buffers you receive (called deltaFrames). The currentDelta keeps track of when the note should play relative to the start of the buffers received.
So if the currentDelta > sampleFrames, that means the note should not play in this cycle (future) - early exit.
If the currentDelta is within range of this cycle then the memory is cleared up to the moment the note should produce output (memset) and the pointers are manipulated to make it look like the buffers begin right on the spot where the sound should play - length -sampleFrames- is also adjusted.
Then in the loop the sound is produced.
Hope it helps.
The process I want to do is to make the FFT to an image (stored in “imagen”) , and then, multiply it with a filter ‘H’, after that, the inverse FFT will be done also.
The code is shown below:
int ancho;
int alto;
ancho=ui.imageframe->imagereader->GetBufferedRegion().GetSize()[0]; //ancho=widht of the image
alto=ui.imageframe->imagereader->GetBufferedRegion().GetSize()[1]; //alto=height of the image
double *H ;
H =matrix2D_H(ancho,alto,eta,sigma); // H is calculated
// We want to get: F= fft(f) ; H*F ; f'=ifft(H*F)
// Inicialization of the neccesary elements for the calculation of the fft
fftw_complex *out;
fftw_plan p;
int N= (ancho/2+1)*alto; //number of points of the image
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*N);
double *in = (double*) imagen.GetPointer(); // conversion of itk.smartpointer --> double*
p = fftw_plan_dft_r2c_2d(ancho, alto, in, out, FFTW_ESTIMATE); // FFT planning
fftw_execute(p); // FFT calculation
/* Multiplication of the Output of the FFT with the Filter H*/
int a = alto;
int b = ancho/2 +1; // The reason for the second dimension to have this value is that when the FFT calculation of a real image is performed only the non-redundants outputs are calculated, that’s the reason for the output of the FFT and the filter ‘H’ to be equal.
// Matrix point-by-point multiplicaction: [axb]*[axb]
fftw_complex* res ; // result will be stored here
res = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*a*b);
res = multiply_matrix_2D(out,H, a, b);
The problem is located here, in the loop inside the function ‘multiply_matrix_2D’:
fftw_complex* prueba_r01::multiply_matrix_2D(fftw_complex* out, double* H, int M ,int N){
/* The matrix out[MxN] or [n0x(n1/2)+1] is the image after the FFT , and the out_H[MxN] is the filter in the frequency domain,
both are multiplied POINT TO POINT, it has to be called twice, one for the imaginary part and another for the normal part
fftw_complex *H_cast;
H_cast = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
H_cast= reinterpret_cast<fftw_complex*> (H); // casting from double* to fftw_complex*
fftw_complex *res; // the result of the multiplication will be stored here
res = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
//Loop for calculating the matrix point-to-point multiplication
for (int x = 0; x<M ; x++){
for (int y = 0; y<N ; y++){
res[x*N+y][0] = out[x*N+y][0]*(H_cast[x*N+y][0]+H_cast[x*N+y][1]);
res[x*N+y][1] = out[x*N+y][1]*(H_cast[x*N+y][0]+H_cast[x*N+y][1]);
return res;
With the values of x = 95 and y = 93 being M = 191 and N = 96;
Uncontroled exception at 0x004273ab in prueba_r01.exe: 0xC0000005 acess infraction reading 0x01274000.
imagen http://img846.imageshack.us/img846/4585/accessviolationproblem.png
Where a lot of values of the variables are in red, and for translation issue: H_cast[][1] has in the value box : “Error30CXX0000 : impossible to evaluate the expression”.
I will really appreciate any kind of help with this please!!
This part of the code
H_cast = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
H_cast= reinterpret_cast<fftw_complex*> (H); // casting from double* to fftw_complex*
first allocates a new buffer for H_cast and then immediately sets it to point to the original H instead. It doesn't copy the data, just the pointer.
At the end of the function some buffer is free'd
which seems to free the data pointed to by H and not the buffer allocated in the function.
When getting back to the caller, the H there is lost!
There is an FFT class inside of ITK that can use fftw (USE_FFTW) from cmake for configuration. This class describes how to reference the ITK raw buffer memory from fftw.
PS: The upcoming ITKv4 has greatly improved the fftw compatibility.