3D Convolution with Intel MKL - c++

I have written a C/C++ code which uses Intel MKL to compute the 3D convolution of an array which has about 300×200×200 elements. I want to apply a kernel which is either 3×3×3 or 5×5×5. Both the 3D input array and the kernel have real values.
This 3D array is stored as a 1D array of type double in a columnwise fashion. Similarly the kernel is of type double and is saved columnwise. For example,
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
For the computation of convolution, I want to perform not-in-place FFT on the input array and the kernel and prevent them from getting modified (I need to use them later in my code) and then do the backward computation in-place.
When I compare the result obtained from my code with the one obtained by MATLAB they are very different. Could someone kindly help me fix the issue? What is missing in my code?
Here is the MATLAB code I used:
a = ones( 10, 10, 10 );
kernel = ones( 3, 3, 3 );
aconvolved = convn( a, kernel, 'same' );
Here is my C/C++ code:
#include <stdio.h>
#include "mkl.h"
void Conv3D(
double *in, double *ker, double *out,
int nRows, int nCols, int nHeights)
{
int NI = nRows;
int NJ = nCols;
int NK = nHeights;
double *in_fft = new double [NI*NJ*NK];
double *ker_fft = new double [NI*NJ*NK];
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
MKL_LONG sizes[] = { NK, NJ, NI };
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
DftiSetValue ( fft_desc, DFTI_PLACEMENT , DFTI_NOT_INPLACE); // Out-of-place computation.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , strides );
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, strides );
DftiSetValue ( fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
DftiComputeForward ( fft_desc, ker, ker_fft );
for (long long i = 0; i < (long long)NI*NJ*NK; ++i )
out[i] = in_fft[i]*ker_fft[i];
// In-place computation.
DftiSetValue ( fft_desc, DFTI_PLACEMENT, DFTI_INPLACE );
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out );
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] ker_fft;
}
int main(int argc, char* argv[])
{
int n = 10;
int nkernel = 3;
double *a = new double [n*n*n]; // This array is real.
double *aconvolved = new double [n*n*n]; // The convolved array is also real.
double *kernel = new double [nkernel*nkernel*nkernel]; // kernel is real.
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Fill the kernel with some 'real' numbers.
for( int i = 0; i < nkernel*nkernel*nkernel; i++ )
kernel[ i ] = 1.0;
// Calculate the convolution.
Conv3D( a, kernel, aconvolved, n, n, n );
printf("Convolved:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", aconvolved[i] );
delete[] a;
delete[] kernel;
delete[] aconvolved;
return 0;
}

You can't reverse the FFT with real-valued frequency data (just the magnitude). A forward FFT needs to output complex data. This is done by setting the DFTI_FORWARD_DOMAIN setting to DFTI_COMPLEX.
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes );
Doing this implicitly sets the backward domain to complex too.
You will also need a complex data type. Probably something like,
MKL_Complex16* in_fft = new MKL_Complex16[NI*NJ*NK];
This means you will have to multiply both the real and imaginary parts:
for (size_t i = 0; i < (size_t)NI*NJ*NK; ++i) {
out_fft[i].real = in_fft[i].real * ker_fft[i].real;
out_fft[i].imag = in_fft[i].imag * ker_fft[i].imag;
}
The output of the inverse FFT is also complex, and assuming your input data is real, you can just grab the .real component and that is your result. This means you'll need a temporary complex output array (say, out_fft as above).
Also note that to avoid artifacts, you want the size of your fft to be (at least) M+N-1 on each dimension. Generally you would choose the next highest power of two for speed.
I strongly suggest you implement it in MATLAB first, using FFTs. There are many such implementations available (example), but I would start from the basics and make a simple function on your own.

Related

LAPACK function gets slower after first iteration

I am implementing an iterative algorithm that uses LAPACK for PSD Projections (doesn't really matter, the point is I'm calling this function over and over):
void useLAPACK(vector<double>& x, int N){
/* Locals */
int n = N, il, iu, m, lda = N, ldz = N, info, lwork, liwork;
double abstol;
double vl,vu;
int iwkopt;
int* iwork;
double wkopt;
double* work;
/* Local arrays */
int isuppz[N];
double w[N], z[N*N];
/* Negative abstol means using the default value */
abstol = -1.0;
/* Set il, iu to compute NSELECT smallest eigenvalues */
vl = 0;
vu = 1.79769e+308;
/* Query and allocate the optimal workspace */
lwork = -1;
liwork = -1;
dsyevr_( (char*)"Vectors", (char*)"V", (char*)"Upper", &n, &x[0], &lda, &vl, &vu, &il, &iu,
&abstol, &m, w, z, &ldz, isuppz, &wkopt, &lwork, &iwkopt, &liwork,
&info );
lwork = (int)wkopt;
work = (double*)malloc( lwork*sizeof(double) );
liwork = iwkopt;
iwork = (int*)malloc( liwork*sizeof(int) );
/* Solve eigenproblem */
dsyevr_( (char*)"Vectors", (char*)"V", (char*)"Upper", &n, &x[0], &lda, &vl, &vu, &il, &iu,
&abstol, &m, w, z, &ldz, isuppz, work, &lwork, iwork, &liwork,
&info );
/* Check for convergence */
if( info > 0 ) {
printf( "The dsyevr (useLAPACK) failed to compute eigenvalues.\n" );
exit( 1 );
}
/* Print the number of eigenvalues found */
//printf( "\n The total number of eigenvalues found:%2i\n", m );
//print_matrix( "Selected eigenvalues", 1, m, w, 1 );
//print_matrix( "Selected eigenvectors (stored columnwise)", n, m, z, ldz );
//Eigenvectors are returned as stacked columns (in total m)
//Outer sum calculation is fastest.
for(int i = 0; i < N*N; ++i) x[i] = 0;
double lambda;
double vrow1,vrow2;
for(int col = 0; col < m; ++col) {
lambda = w[col];
for (int row1 = 0; row1 < N; ++row1) {
vrow1 = z[N*col+row1];
for(int row2 = 0; row2 < N; ++row2){
vrow2 = z[N*col+row2];
x[row1*N+row2] += lambda*vrow1*vrow2;
}
}
}
free( (void*)iwork );
free( (void*)work );
}
Now my time measurements show that the first call takes about 4ms, but then it increases to 100ms. Is there a good explanation for this in this code? x is the same vector every time.
I think I have figured out the problem. My algorithm starts with the zero matrix and afterwards the number of positive eigenvalues are more or less half positive half negative. dsyevr only calculates positive eigenvalues and corresponding eigenvectors with those arguments. I suppose if all are zero it doesn't really have to calculate any eigenvectors, which makes the algorithm much faster. Thanks for all the answers and sorry about the missing information.

ifft results are different from original signal

FFT works fine, but when I want to take IFFT I always see the same graph from its results. Results are complex and graph always the same regardless of the original signal.
in real part graph is a -sin with period = frame size
in imaginary part it is a -cos with the same period
Where can be a problem?
original signal:
IFFT real value (on pics are only half of frame):
Algorithm FFT that I use.
double** FFT(double** f, int s, bool inverse) {
if (s == 1) return f;
int sH = s / 2;
double** fOdd = new double*[sH];
double** fEven = new double*[sH];
for (int i = 0; i < sH; i++) {
int j = 2 * i;
fOdd[i] = f[j];
fEven[i] = f[j + 1];
}
double** sOdd = FFT(fOdd, sH, inverse);
double** sEven = FFT(fEven, sH, inverse);
double**spectr = new double*[s];
double arg = inverse ? DoublePI / s : -DoublePI / s;
double*oBase = new double[2]{ cos(arg),sin(arg) };
double*o = new double[2]{ 1,0 };
for (int i = 0; i < sH; i++) {
double* sO1 = Mul(o, sOdd[i]);
spectr[i] = Sum(sEven[i], sO1);
spectr[i + sH] = Dif(sEven[i], sO1);
o = Mul(o, oBase);
}
return spectr;
}
The "butterfly" portion is applying the coefficients incorrectly:
for (int i = 0; i < sH; i++) {
double* sO1 = sOdd[i];
double* sE1 = Mul(o, sEven[i]);
spectr[i] = Sum(sO1, sE1);
spectr[i + sH] = Dif(sO1, sE1);
o = Mul(o, oBase);
}
Side Note:
I kept your notation but it makes things confusing:
fOdd has indexes 0, 2, 4, 6, ... so it should be fEven
fEven has indexes 1, 3, 5, 7, ... so it should be fOdd
really sOdd should be sLower and sEven should be sUpper since they correspond to the 0:s/2 and s/2:s-1 elements of the spectrum respectively:
sLower = FFT(fEven, sH, inverse); // fEven is 0, 2, 4, ...
sUpper = FFT(fOdd, sH, inverse); // fOdd is 1, 3, 5, ...
Then the butterfly becomes:
for (int i = 0; i < sH; i++) {
double* sL1 = sLower[i];
double* sU1 = Mul(o, sUpper[i]);
spectr[i] = Sum(sL1, sU1);
spectr[i + sH] = Dif(sL1, sU1);
o = Mul(o, oBase);
}
When written like this it is easier to compare to this pseudocode example on wikipedia.
And #Dai is correct you are going to leak a lot of memory
Regarding the memory, you can use the std::vector to encapsulate dynamically-allocated arrays and to ensure they're deallocated when execution leaves scope. You could use unique_ptr<double[]> but the performance gains are not worth it IMO and you lose the safety of the at() method.
(Based on #Robb's answer)
A few other tips:
Avoid cryptic identifiers - programs should be readable, and names like "f" and "s" make your program harder to read and maintain.
Type-based Hungarian notation is frowned upon as modern editors show type information automatically so it adds unnecessary complication to identifier names.
Use size_t for indexes, not int
The STL is your friend, use it!
Preemptively prevent bugs by using const to prevent accidental mutation of read-only data.
Like so:
#include <vector>
using namespace std;
vector<double> fastFourierTransform(const vector<double> signal, const bool inverse) {
if( signal.size() < 2 ) return signal;
const size_t half = signal.size() / 2;
vector<double> lower; lower.reserve( half );
vector<double> upper; upper.reserve( half );
bool isEven = true;
for( size_t i = 0; i < signal.size(); i++ ) {
if( isEven ) lower.push_back( signal.at( i ) );
else upper.push_back( signal.at( i ) );
isEven = !isEven;
}
vector<double> lowerFft = fastFourierTransform( lower, inverse );
vector<double> upperFft = fastFourierTransform( upper, inverse );
vector<double> result;
result.reserve( signal.size() );
double arg = ( inverse ? 1 : -1 ) * ( DoublePI / signal.size() );
// Ideally these should be local `double` values passed directly into `Mul`.
unique_ptr<double[]> oBase = make_unique<double[]>( 2 );
oBase[0] = cos(arg);
oBase[1] = sin(arg);
unique_ptr<double[]> o = make_unique<double[]>( 2 );
o[0] = 0;
o[1] = 0;
for( size_t i = 0; i < half; i++ ) {
double* lower1 = lower.at( i );
double* upper1 = Mul( o, upper.at( i ) );
result.at( i ) = Sum( lower1, upper1 );
result.at( i + half ) = Dif( lower1, upper1 );
o = Mul( o, oBase );
}
// My knowledge of move-semantics of STL containers is a bit rusty - so there's probably a better way to return the output 'result' vector.
return result;
}

3D FFT Using Intel MKL with Zero Padding

I want to compute 3D FFT using Intel MKL of an array which has about 300×200×200 elements. This 3D array is stored as a 1D array of type double in a columnwise fashion:
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
I want to perform not-in-place FFT on the input array and prevent it from getting modified (I need to use it later in my code) and then do the backward computation in-place. I also want to have zero padding.
My questions are:
How can I perform the zero-padding?
How should I deal with the size of the arrays used by FFT functions when zero padding is included in the computation?
How can I take out the zero padded results and get the actual result?
Here is my attempt to the problem, I would be absolutely thankful for any comment, suggestion, or hint.
#include <stdio.h>
#include "mkl.h"
int max(int a, int b, int c)
{
int m = a;
(m < b) && (m = b);
(m < c) && (m = c);
return m;
}
void FFT3D_R2C( // Real to Complex 3D FFT.
double *in, int nRowsIn , int nColsIn , int nHeightsIn ,
double *out )
{
int n = max( nRowsIn , nColsIn , nHeightsIn );
// Round up to the next highest power of 2.
unsigned int N = (unsigned int) n; // compute the next highest power of 2 of 32-bit n.
N--;
N |= N >> 1;
N |= N >> 2;
N |= N >> 4;
N |= N >> 8;
N |= N >> 16;
N++;
/* Strides describe data layout in real and conjugate-even domain. */
MKL_LONG rs[4], cs[4];
// DFTI descriptor.
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
// Variables needed for out-of-place computations.
MKL_Complex16 *in_fft = new MKL_Complex16 [ N*N*N ];
MKL_Complex16 *out_fft = new MKL_Complex16 [ N*N*N ];
double *out_ZeroPadded = new double [ N*N*N ];
/* Compute strides */
rs[3] = 1; cs[3] = 1;
rs[2] = (N/2+1)*2; cs[2] = (N/2+1);
rs[1] = N*(N/2+1)*2; cs[1] = N*(N/2+1);
rs[0] = 0; cs[0] = 0;
// Create DFTI descriptor.
MKL_LONG sizes[] = { N, N, N };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
// Configure DFTI descriptor.
DftiSetValue( fft_desc, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX );
DftiSetValue( fft_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); // Out-of-place transformation.
DftiSetValue( fft_desc, DFTI_INPUT_STRIDES , rs );
DftiSetValue( fft_desc, DFTI_OUTPUT_STRIDES , cs );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
// Change strides to compute backward transform.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , cs);
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, rs);
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out_fft, out_ZeroPadded );
// Printing the zero padded 3D FFT result.
for( long long i = 0; i < (long long)N*N*N; i++ )
printf("%f\n", out_ZeroPadded[i] );
/* I don't know how to take out the zero padded results and
save the actual result in the variable named "out" */
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] out_ZeroPadded ;
}
int main()
{
int n = 10;
double *a = new double [n*n*n]; // This array is real.
double *afft = new double [n*n*n];
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Calculate FFT.
FFT3D_R2C( a, n, n, n, afft );
printf("FFT results:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", afft[i] );
delete[] a;
delete[] afft;
return 0;
}
just few hints:
Power of 2 size
I don't like the way you are computing the size
so let Nx,Ny,Nz be the size of input matrix
and nx,ny,nz size of the padded matrix
for (nx=1;nx<Nx;nx<<=1);
for (ny=1;ny<Ny;ny<<=1);
for (nz=1;nz<Nz;nz<<=1);
now zero pad by memset to zero first and then copy the matrix lines
padding to N^3 instead of nx*ny*nz can result in big slowdowns
if nx,ny,nz are not close to each other
output is complex
if I get it right a is input real matrix
and afft the output complex matrix
so why not allocate the space for it correctly?
double *afft = new double [2*nx*ny*nz];
complex number is real+imaginary part so 2 values per number
that goes also for the final print of result
and some "\r\n" after lines would be good for viewing
3D DFFT
I do not use nor know your DFFT library
I use mine own, but anyway 3D DFFT can be done by 1D DFFT
if you do it by the lines ... see this 2D DFCT by 1D DFFT
in 3D is the same but you need to add one pass and different normalization constant
this way you can have single line buffer double lin[2*max(nx,ny,nz)];
and make the zero padding on the run (so no need to have bigger matrix in memory)...
but that involves coping the lines on each 1D DFFT ...

FFTW and OpenCV's C++ interface, real and imaginary part in Mat output

I'm trying to code a FFT/IFFT function with FFTW 3.3 and OpenCV 2.1 using the C++ interface. I've seen a lot of examples using the old OpenCV formats and I did a direct conversion, but something doesn't work.
The objective of my function is to return a Mat object with the real part and the imaginary part of the FFT, like dft default OpenCV function does. Here is the code of the function. Program gets blocked with memory problem in the lines that copy im_data to data_in.
Does somebody know what am I doing wrong? Thank you
Mat fft_sr(Mat& I)
{
double *im_data;
double *realP_data;
double *imP_data;
fftw_complex *data_in;
fftw_complex *fft;
fftw_plan plan_f;
int width = I.cols;
int height = I.rows;
int step = I.step;
int i, j, k;
Mat realP=Mat::zeros(height,width,CV_64F); // Real Part FFT
Mat imP=Mat::zeros(height,width,CV_64F); // Imaginary Part FFT
im_data = ( double* ) I.data;
realP_data = ( double* ) realP.data;
imP_data = ( double* ) imP.data;
data_in = ( fftw_complex* )fftw_malloc( sizeof( fftw_complex ) * width * height );
fft = ( fftw_complex* )fftw_malloc( sizeof( fftw_complex ) * width * height );
// Problem Here
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width ; j++ ) {
data_in[k][0] = ( double )im_data[i * step + j];
data_in[k][1] = ( double )0.0;
k++;
}
}
plan_f = fftw_plan_dft_2d( height, width, data_in, fft, FFTW_FORWARD, FFTW_ESTIMATE );
fftw_execute( plan_f );
// Copy real and imaginary data
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width ; j++ ) {
realP_data[i * step + j] = ( double )fft[k][0];
imP_data[i * step + j] = ( double )fft[k][1];
k++;
}
}
Mat fft_I(I.size(),CV_64FC2);
Mat fftplanes[] = {Mat_<double>(realP), Mat_<double>(imP)};
merge(fftplanes, 2, fft_I);
fftw_destroy_plan(plan_f);
fftw_free(data_in);
fftw_free(fft);
return fft_I;
}
You are using step wrong. It is meant to index into Mat::data. Since you already casted Mat::data to double* when assigning it to im_data, you can index into im_data "normally":
data_in[k][0] = im_data[i * width + j];
When using step the correct way to index is:
data_in[k][0] = ( double )I.data[i * step + j];
Update:
Try to access your images row-wise. That way you avoid running into problems with stride/step, while still exploiting fast access:
for (int i = 0; i < I.rows; i++)
{
double* row = I.ptr<double>(i);
for (int j = 0; j < I.cols; j++)
{
// Do something with the current pixel.
double someValue = row[j];
}
}
I know this is old but when you are using fftw you need to initialize fftw_complex *data_in
only after creating the plan for the fft, if i recall correctly when you create the plan it sets all the
*data_in values to 0.
so allocate before the plan and initialize after!
Statement
im_data = ( double* ) I.data;
defines im_data as double pointer to image data.
I think that should be mandatory that I was a double values image.

DFT algorithm and convolution. what is wrong?

#include <vector>
using std::vector;
#include <complex>
using std::complex;
using std::polar;
typedef complex<double> Complex;
#define Pi 3.14159265358979323846
// direct Fourier transform
vector<Complex> dF( const vector<Complex>& in )
{
const int N = in.size();
vector<Complex> out( N );
for (int k = 0; k < N; k++)
{
out[k] = Complex( 0.0, 0.0 );
for (int n = 0; n < N; n++)
{
out[k] += in[n] * polar<double>( 1.0, - 2 * Pi * k * n / N );
}
}
return out;
}
// inverse Fourier transform
vector<Complex> iF( const vector<Complex>& in )
{
const int N = in.size();
vector<Complex> out( N );
for (int k = 0; k < N; k++)
{
out[k] = Complex( 0.0, 0.0 );
for (int n = 0; n < N; n++)
{
out[k] += in[n] * polar<double>(1, 2 * Pi * k * n / N );
}
out[k] *= Complex( 1.0 / N , 0.0 );
}
return out;
}
Who can say, what is wrong???
Maybe i don't understand details of implementation this algorithm... But i can't find it )))
also, i need to calculate convolution.
But i can't find test example.
UPDATE
// convolution. I suppose that x0.size == x1.size
vector convolution( const vector& x0, const vector& x1)
{
const int N = x0.size();
vector<Complex> tmp( N );
for ( int i = 0; i < N; i++ )
{
tmp[i] = x0[i] * x1[i];
}
return iF( tmp );
}
I really don't know exactly what your asking, but your DFT and IDFT algorithms look correct to me. Convolution can be performed using the DFT and IDFT using the circular convolution theorem which basically states that f**g = IDFT(DFT(f) * DFT(g)) where ** is circular convolution and * is simple multiplication.
To compute linear convolution (non-circular) using the DFT, you must zero-pad each of the inputs so that the circular wrap-around only occurs for zero-valued samples and does not affect the output. Each input sequence needs to be zero padded to a length of N >= L+M-1 where L and M are the lengths of the input sequences. Then you perform circular convolution as shown above and the first L+M-1 samples are the linear convolution output (samples beyond this should be zero).
Note: Performing convolution with the DFT and IDFT algorithms you have shown is much more inefficient than just computing it directly. The advantage only comes when using an FFT and IFFT(O(NlogN)) algorithm in place of the DFT and IDFT (O(N^2)).
Check FFTW library "for computing the discrete Fourier transform (DFT)" and its C# wrapper;) Maybe this too;)
Good luck!
The transforms look fine, but there's nothing in the program that is doing convolution.
UPDATE: the convolution code needs to forward transform the inputs first before the element-wise multiplication.