DFT algorithm and convolution. what is wrong? - c++

#include <vector>
using std::vector;
#include <complex>
using std::complex;
using std::polar;
typedef complex<double> Complex;
#define Pi 3.14159265358979323846
// direct Fourier transform
vector<Complex> dF( const vector<Complex>& in )
{
const int N = in.size();
vector<Complex> out( N );
for (int k = 0; k < N; k++)
{
out[k] = Complex( 0.0, 0.0 );
for (int n = 0; n < N; n++)
{
out[k] += in[n] * polar<double>( 1.0, - 2 * Pi * k * n / N );
}
}
return out;
}
// inverse Fourier transform
vector<Complex> iF( const vector<Complex>& in )
{
const int N = in.size();
vector<Complex> out( N );
for (int k = 0; k < N; k++)
{
out[k] = Complex( 0.0, 0.0 );
for (int n = 0; n < N; n++)
{
out[k] += in[n] * polar<double>(1, 2 * Pi * k * n / N );
}
out[k] *= Complex( 1.0 / N , 0.0 );
}
return out;
}
Who can say, what is wrong???
Maybe i don't understand details of implementation this algorithm... But i can't find it )))
also, i need to calculate convolution.
But i can't find test example.
UPDATE
// convolution. I suppose that x0.size == x1.size
vector convolution( const vector& x0, const vector& x1)
{
const int N = x0.size();
vector<Complex> tmp( N );
for ( int i = 0; i < N; i++ )
{
tmp[i] = x0[i] * x1[i];
}
return iF( tmp );
}

I really don't know exactly what your asking, but your DFT and IDFT algorithms look correct to me. Convolution can be performed using the DFT and IDFT using the circular convolution theorem which basically states that f**g = IDFT(DFT(f) * DFT(g)) where ** is circular convolution and * is simple multiplication.
To compute linear convolution (non-circular) using the DFT, you must zero-pad each of the inputs so that the circular wrap-around only occurs for zero-valued samples and does not affect the output. Each input sequence needs to be zero padded to a length of N >= L+M-1 where L and M are the lengths of the input sequences. Then you perform circular convolution as shown above and the first L+M-1 samples are the linear convolution output (samples beyond this should be zero).
Note: Performing convolution with the DFT and IDFT algorithms you have shown is much more inefficient than just computing it directly. The advantage only comes when using an FFT and IFFT(O(NlogN)) algorithm in place of the DFT and IDFT (O(N^2)).

Check FFTW library "for computing the discrete Fourier transform (DFT)" and its C# wrapper;) Maybe this too;)
Good luck!

The transforms look fine, but there's nothing in the program that is doing convolution.
UPDATE: the convolution code needs to forward transform the inputs first before the element-wise multiplication.

Related

What is wrong with my 2D Array Gaussian Blur function in C++?

I am making a simple Gaussian blur function for a 2D array that is supposed to represent an image. The function just prints out the array values at the end (no actual image processing going on here). I was pretty sure that I had implemented everything correct, but the values I am getting for (N=3, sigma=1.5) are much lower than expected based on this calculator: http://dev.theomader.com/gaussian-kernel-calculator/
I am following this equation:
void gaussian_filter(int N, double sigma) {
double k[N][N];
for(int i=0; i<N; i++) { //Initialize kernal to 0
for(int j=0; j<N; j++) {
k[i][j] = 0;
}
}
double sum = 0.0; //There is an issue somewhere in this block of code
int change = (N/2);
double r, s = change * sigma * sigma;
for (int x = -change; x <= change; x++) {
for(int y = -change; y <= change; y++) {
r = sqrt(x*x + y*y);
k[x + change][y + change] = (exp(-(r*r)/s))/(M_PI * s);
sum += k[x + change][y + change];
}
}
for(int i = 0; i < N; ++i) { //Normalize
for(int j = 0; j < N; ++j) {
k[i][j] /= sum;
}
}
for(int i = 0; i < N; ++i) { //Print out array
for (int j = 0; j < N; ++j)
cout<<k[i][j]<<"\t";
}
cout<<endl;
}
}
Here is the expected output for N=3 and Sigma=1.5
Here is the current broken output for N=3 and Sigma=1.5
Why does s depend on change? I think you should do:
double r, s = 2 * sigma * sigma;
// instead of
// double r, s = change * sigma * sigma;
That website computes Gaussian kernels in an unorthodox manner:
The weights are calculated by numerical integration of the continuous gaussian distribution over each discrete kernel tap.
That is, it samples a continuous Gaussian kernel that has been convolved with a uniform (“box”) filter of 1 pixel wide. The resulting Gaussian is wider than advertised. I advise against this method.
The proper way to create a Gaussian kernel is to just sample the Gaussian function at given integer locations, for example x = [-3, -2, -1, 0, 1, 2, 3].
Do note that a 3-pixel kernel is not wide enough to represent a Gaussian. It is important to sample the tail of the curve, without it, the kernel doesn’t have the good properties of the Gaussian kernel. I recommend sampling up to 3 sigma to each side, leading to 2*ceil(3*sigma)+1 pixels. 2 sigma is the bare minimum, useful only when speed is more important than good results.
Do also note that the Gaussian is separable, you can apply two 1D kernels in succession, rather than a single 2D kernel. For the 9x9 kernel you get for sigma=1.5, this translates to 9+9=18 multiplications and additions, compared to 9x9=81 for the 2D kernel. This is a significant saving!

What is the fastest way to calculate determinant?

I'm writing a library where I want to have some basic NxN matrix functionality that doesn't have any dependencies and it is a bit of a learning project. I'm comparing my performance to Eigen. I've been able to be pretty equal and even beat its performance on a couple front with SSE2 and with AVX2 beat it on quite a few fronts (it only uses SSE2 so not super surprising).
My issue is I'm using Gaussian Elimination to create an Upper Diagonalized matrix then multiplying the diagonal to get the determinant.I beat Eigen for N < 300 but after that Eigen blows me away and it just gets worse as the matrices get bigger. Given all the memory is accessed sequentially and the compiler dissassembly doesn't look terrible I don't think it is an optimization issue.
There is more optimization that can be done but the timings look much more like an algorithmic timing complexity issue or there is a major SSE advantage I'm not seeing. Simply unrolling the loops a bit hasn't done much for me when trying that.
Is there a better algorithm for calculating determinants?
Scalar code
/*
Warning: Creates Temporaries!
*/
template<typename T, int ROW, int COLUMN> MML_INLINE T matrix<T, ROW, COLUMN>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<T, ROW, COLUMN> temp(*this);
/*We convert the temporary to upper triangular form*/
uint N = row();
T det = T(1);
for (uint c = 0; c < N; ++c)
{
det = det*temp(c,c);
for (uint r = c + 1; r < N; ++r)
{
T ratio = temp(r, c) / temp(c, c);
for (uint k = c; k < N; k++)
{
temp(r, k) = temp(r, k) - ratio * temp(c, k);
}
}
}
return det;
}
AVX2
template<> float matrix<float>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<float> temp(*this);
/*We convert the temporary to upper triangular form*/
float det = 1.0f;
const uint N = row();
const uint Nm8 = N - 8;
const uint Nm4 = N - 4;
uint c = 0;
for (; c < Nm8; ++c)
{
det *= temp(c, c);
float8 Diagonal = _mm256_set1_ps(temp(c, c));
for (uint r = c + 1; r < N;++r)
{
float8 ratio1 = _mm256_div_ps(_mm256_set1_ps(temp(r,c)), Diagonal);
uint k = c + 1;
for (; k < Nm8; k += 8)
{
float8 ref = _mm256_loadu_ps(temp._v + c*N + k);
float8 r0 = _mm256_loadu_ps(temp._v + r*N + k);
_mm256_storeu_ps(temp._v + r*N + k, _mm256_fmsub_ps(ratio1, ref, r0));
}
/*We go Scalar for the last few elements to handle non-multiples of 8*/
for (; k < N; ++k)
{
_mm_store_ss(temp._v + index(r, k), _mm_sub_ss(_mm_set_ss(temp(r, k)), _mm_mul_ss(_mm256_castps256_ps128(ratio1),_mm_set_ss(temp(c, k)))));
}
}
}
for (; c < Nm4; ++c)
{
det *= temp(c, c);
float4 Diagonal = _mm_set1_ps(temp(c, c));
for (uint r = c + 1; r < N; ++r)
{
float4 ratio = _mm_div_ps(_mm_set1_ps(temp[r*N + c]), Diagonal);
uint k = c + 1;
for (; k < Nm4; k += 4)
{
float4 ref = _mm_loadu_ps(temp._v + c*N + k);
float4 r0 = _mm_loadu_ps(temp._v + r*N + k);
_mm_storeu_ps(temp._v + r*N + k, _mm_sub_ps(r0, _mm_mul_ps(ref, ratio)));
}
float fratio = _mm_cvtss_f32(ratio);
for (; k < N; ++k)
{
temp(r, k) = temp(r, k) - fratio*temp(c, k);
}
}
}
for (; c < N; ++c)
{
det *= temp(c, c);
float Diagonal = temp(c, c);
for (uint r = c + 1; r < N; ++r)
{
float ratio = temp[r*N + c] / Diagonal;
for (uint k = c+1; k < N;++k)
{
temp(r, k) = temp(r, k) - ratio*temp(c, k);
}
}
}
return det;
}
Algorithms to reduce an n by n matrix to upper (or lower) triangular form by Gaussian elimination generally have complexity of O(n^3) (where ^ represents "to power of").
There are alternative approaches for computing determinate, such as evaluating the set of eigenvalues (the determinant of a square matrix is equal to the product of its eigenvalues). For general matrices, computation of the complete set of eigenvalues is also - practically - O(n^3).
In theory, however, calculation of the set of eigenvalues has complexity of n^w where w is between 2 and 2.376 - which means for (much) larger matrices it will be faster than using Gaussian elimination. Have a look at an article "Fast linear algebra is stable" by James Demmel, Ioana Dumitriu, and Olga Holtz in Numerische Mathematik, Volume 108, Issue 1, pp. 59-91, November 2007. If Eigen uses an approach with complexity less than O(n^3) for larger matrices (I don't know, never having had reason to investigate such things) that would explain your observations.
The answer most places seem to use Block LU Factorization to create an Lower triangle and Upper triangle matrix in the same memory space. It is ~O(n^2.5) depending on the size of block you use.
Here is a power point from Rice University that explains the algorithm.
www.caam.rice.edu/~timwar/MA471F03/Lecture24.ppt
Division by a matrix means multiplication by its inverse.
The idea seems to be to increase the number of n^2 operations significantly but reduce the number m^3 which in effect lowers the complexity of the algorithm since m is of a fixed small size.
Going to take a little bit to write this up in an efficient manner since to do it efficiently requires 'in place' algorithms I don't have written yet.

Discrete Fourier Transform implementation gives different result than OpenCV DFT

We have implemented DFT and wanted to test it with OpenCV's implementation. The results are different.
our DFT's results are in order from smallest to biggest, whereas OpenCV's results are not in any order.
the first (0th) value is the same for both calculations, as in this case, the complex part is 0 (since e^0 = 1, in the formula). The other values are different, for example OpenCV's results contain negative values, whereas ours does not.
This is our implementation of DFT:
// complex number
std::complex<float> j;
j = -1;
j = std::sqrt(j);
std::complex<float> result;
std::vector<std::complex<float>> fourier; // output
// this->N = length of contour, 512 in our case
// foreach fourier descriptor
for (int n = 0; n < this->N; ++n)
{
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
and this is how we calculate the DFT with OpenCV:
std::vector<std::complex<float>> fourierCV; // output
cv::dft(std::vector<float>(centroidDistance, centroidDistance + this->N), fourierCV, cv::DFT_SCALE | cv::DFT_COMPLEX_OUTPUT);
The variable centroidDistance is calculated in a previous step.
Note: please avoid answers saying use OpenCV instead of your own implementation.
You forgot to initialise result for each iteration of n:
for (int n = 0; n < this->N; ++n)
{
result = 0.0f; // initialise `result` to 0 here <<<
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}

3D FFT Using Intel MKL with Zero Padding

I want to compute 3D FFT using Intel MKL of an array which has about 300×200×200 elements. This 3D array is stored as a 1D array of type double in a columnwise fashion:
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
I want to perform not-in-place FFT on the input array and prevent it from getting modified (I need to use it later in my code) and then do the backward computation in-place. I also want to have zero padding.
My questions are:
How can I perform the zero-padding?
How should I deal with the size of the arrays used by FFT functions when zero padding is included in the computation?
How can I take out the zero padded results and get the actual result?
Here is my attempt to the problem, I would be absolutely thankful for any comment, suggestion, or hint.
#include <stdio.h>
#include "mkl.h"
int max(int a, int b, int c)
{
int m = a;
(m < b) && (m = b);
(m < c) && (m = c);
return m;
}
void FFT3D_R2C( // Real to Complex 3D FFT.
double *in, int nRowsIn , int nColsIn , int nHeightsIn ,
double *out )
{
int n = max( nRowsIn , nColsIn , nHeightsIn );
// Round up to the next highest power of 2.
unsigned int N = (unsigned int) n; // compute the next highest power of 2 of 32-bit n.
N--;
N |= N >> 1;
N |= N >> 2;
N |= N >> 4;
N |= N >> 8;
N |= N >> 16;
N++;
/* Strides describe data layout in real and conjugate-even domain. */
MKL_LONG rs[4], cs[4];
// DFTI descriptor.
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
// Variables needed for out-of-place computations.
MKL_Complex16 *in_fft = new MKL_Complex16 [ N*N*N ];
MKL_Complex16 *out_fft = new MKL_Complex16 [ N*N*N ];
double *out_ZeroPadded = new double [ N*N*N ];
/* Compute strides */
rs[3] = 1; cs[3] = 1;
rs[2] = (N/2+1)*2; cs[2] = (N/2+1);
rs[1] = N*(N/2+1)*2; cs[1] = N*(N/2+1);
rs[0] = 0; cs[0] = 0;
// Create DFTI descriptor.
MKL_LONG sizes[] = { N, N, N };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
// Configure DFTI descriptor.
DftiSetValue( fft_desc, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX );
DftiSetValue( fft_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); // Out-of-place transformation.
DftiSetValue( fft_desc, DFTI_INPUT_STRIDES , rs );
DftiSetValue( fft_desc, DFTI_OUTPUT_STRIDES , cs );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
// Change strides to compute backward transform.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , cs);
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, rs);
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out_fft, out_ZeroPadded );
// Printing the zero padded 3D FFT result.
for( long long i = 0; i < (long long)N*N*N; i++ )
printf("%f\n", out_ZeroPadded[i] );
/* I don't know how to take out the zero padded results and
save the actual result in the variable named "out" */
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] out_ZeroPadded ;
}
int main()
{
int n = 10;
double *a = new double [n*n*n]; // This array is real.
double *afft = new double [n*n*n];
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Calculate FFT.
FFT3D_R2C( a, n, n, n, afft );
printf("FFT results:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", afft[i] );
delete[] a;
delete[] afft;
return 0;
}
just few hints:
Power of 2 size
I don't like the way you are computing the size
so let Nx,Ny,Nz be the size of input matrix
and nx,ny,nz size of the padded matrix
for (nx=1;nx<Nx;nx<<=1);
for (ny=1;ny<Ny;ny<<=1);
for (nz=1;nz<Nz;nz<<=1);
now zero pad by memset to zero first and then copy the matrix lines
padding to N^3 instead of nx*ny*nz can result in big slowdowns
if nx,ny,nz are not close to each other
output is complex
if I get it right a is input real matrix
and afft the output complex matrix
so why not allocate the space for it correctly?
double *afft = new double [2*nx*ny*nz];
complex number is real+imaginary part so 2 values per number
that goes also for the final print of result
and some "\r\n" after lines would be good for viewing
3D DFFT
I do not use nor know your DFFT library
I use mine own, but anyway 3D DFFT can be done by 1D DFFT
if you do it by the lines ... see this 2D DFCT by 1D DFFT
in 3D is the same but you need to add one pass and different normalization constant
this way you can have single line buffer double lin[2*max(nx,ny,nz)];
and make the zero padding on the run (so no need to have bigger matrix in memory)...
but that involves coping the lines on each 1D DFFT ...

3D Convolution with Intel MKL

I have written a C/C++ code which uses Intel MKL to compute the 3D convolution of an array which has about 300×200×200 elements. I want to apply a kernel which is either 3×3×3 or 5×5×5. Both the 3D input array and the kernel have real values.
This 3D array is stored as a 1D array of type double in a columnwise fashion. Similarly the kernel is of type double and is saved columnwise. For example,
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
For the computation of convolution, I want to perform not-in-place FFT on the input array and the kernel and prevent them from getting modified (I need to use them later in my code) and then do the backward computation in-place.
When I compare the result obtained from my code with the one obtained by MATLAB they are very different. Could someone kindly help me fix the issue? What is missing in my code?
Here is the MATLAB code I used:
a = ones( 10, 10, 10 );
kernel = ones( 3, 3, 3 );
aconvolved = convn( a, kernel, 'same' );
Here is my C/C++ code:
#include <stdio.h>
#include "mkl.h"
void Conv3D(
double *in, double *ker, double *out,
int nRows, int nCols, int nHeights)
{
int NI = nRows;
int NJ = nCols;
int NK = nHeights;
double *in_fft = new double [NI*NJ*NK];
double *ker_fft = new double [NI*NJ*NK];
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
MKL_LONG sizes[] = { NK, NJ, NI };
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
DftiSetValue ( fft_desc, DFTI_PLACEMENT , DFTI_NOT_INPLACE); // Out-of-place computation.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , strides );
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, strides );
DftiSetValue ( fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
DftiComputeForward ( fft_desc, ker, ker_fft );
for (long long i = 0; i < (long long)NI*NJ*NK; ++i )
out[i] = in_fft[i]*ker_fft[i];
// In-place computation.
DftiSetValue ( fft_desc, DFTI_PLACEMENT, DFTI_INPLACE );
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out );
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] ker_fft;
}
int main(int argc, char* argv[])
{
int n = 10;
int nkernel = 3;
double *a = new double [n*n*n]; // This array is real.
double *aconvolved = new double [n*n*n]; // The convolved array is also real.
double *kernel = new double [nkernel*nkernel*nkernel]; // kernel is real.
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Fill the kernel with some 'real' numbers.
for( int i = 0; i < nkernel*nkernel*nkernel; i++ )
kernel[ i ] = 1.0;
// Calculate the convolution.
Conv3D( a, kernel, aconvolved, n, n, n );
printf("Convolved:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", aconvolved[i] );
delete[] a;
delete[] kernel;
delete[] aconvolved;
return 0;
}
You can't reverse the FFT with real-valued frequency data (just the magnitude). A forward FFT needs to output complex data. This is done by setting the DFTI_FORWARD_DOMAIN setting to DFTI_COMPLEX.
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes );
Doing this implicitly sets the backward domain to complex too.
You will also need a complex data type. Probably something like,
MKL_Complex16* in_fft = new MKL_Complex16[NI*NJ*NK];
This means you will have to multiply both the real and imaginary parts:
for (size_t i = 0; i < (size_t)NI*NJ*NK; ++i) {
out_fft[i].real = in_fft[i].real * ker_fft[i].real;
out_fft[i].imag = in_fft[i].imag * ker_fft[i].imag;
}
The output of the inverse FFT is also complex, and assuming your input data is real, you can just grab the .real component and that is your result. This means you'll need a temporary complex output array (say, out_fft as above).
Also note that to avoid artifacts, you want the size of your fft to be (at least) M+N-1 on each dimension. Generally you would choose the next highest power of two for speed.
I strongly suggest you implement it in MATLAB first, using FFTs. There are many such implementations available (example), but I would start from the basics and make a simple function on your own.