Making C++ Eigen LU faster (my tests show 2x slower than GSL) - c++

I'm comparing LU decomposition/solve of Eigen to GSL, and find Eigen to be on the order of 2x slower with -O3 optimizations on g++/OSX. I isolated timing of the decompose and the solve, but find both to be substantially slower than their GSL counterparts. Am I doing something silly, or does Eigen not perform well for this use case (e.g. very small systems?) Built with Eigen 3.2.8 and an older GSL 1.15. The test case is very contrived, but mirrors the results in some nonlinear-fitting software I'm writing - Eigen being anywhere from 1.5x - 2x+ slower for the total LU/solve operation.
#define NDEBUG
#include "sys/time.h"
#include "gsl/gsl_linalg.h"
#include <Eigen/LU>
// Ax=b is a 3x3 system for which soln is x=[8,2,3]
//
double avals_col[9] = { 4, 2, 2, 7, 5, 5, 7, 5, 9 };
// col major
double avals_row[9] = { 4, 7, 7, 2, 5, 5, 2, 5, 9 };
// row major
double bvals[9] = { 67, 41, 53 };
//----------- helpers
void print_solution( double *x, int dim, char *which ) {
printf( "%s solve for x:\n", which );
for( int j=0; j<3; j++ ) {
printf( "%g ", x[j] );
}
printf( "\n" );
}
struct timeval tv;
struct timezone tz;
double timeNow() {
gettimeofday( &tv, &tz );
int _mils = tv.tv_usec/1000;
int _secs = tv.tv_sec;
return (double)_secs + ((double)_mils/1000.0);
}
//-----------
void run_gsl( double *A, double *b, double *x, int dim, int reps ) {
gsl_matrix_view gslA;
gsl_vector_view gslB;
gsl_vector_view gslX;
gsl_permutation *gslP;
int sign;
gslA = gsl_matrix_view_array( A, dim, dim );
gslP = gsl_permutation_alloc( dim );
gslB = gsl_vector_view_array( b, dim );
gslX = gsl_vector_view_array( x, dim );
int err;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
// gsl overwrites A during decompose, so we must copy the initial A each time.
memcpy( A, avals_row, sizeof(avals_row) );
err = gsl_linalg_LU_decomp( &gslA.matrix, gslP, &sign );
}
elapsed = timeNow() - t;
printf( "GSL decompose (%dx) time = %g\n", reps, elapsed );
t = timeNow();
for( int i=0; i<reps; i++ ) {
err = gsl_linalg_LU_solve( &gslA.matrix, gslP, &gslB.vector, &gslX.vector );
}
elapsed = timeNow() - t;
printf( "GSL solve (%dx) time = %g\n", reps, elapsed );
gsl_permutation_free( gslP );
}
void run_eigen( double *A, double *b, double *x, int dim, int reps ) {
Eigen::PartialPivLU<Eigen::MatrixXd> eigenA_lu;
Eigen::Map< Eigen::Matrix < double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor > > ma( A, dim, dim );
Eigen::Map<Eigen::MatrixXd> mb( b, dim, 1 );
int err;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
// This memcpy is not necessary for Eigen, which does not overwrite A in the
// decompose, but do it so that the time is directly comparable to GSL.
memcpy( A, avals_col, sizeof(avals_col) );
eigenA_lu.compute( ma );
}
elapsed = timeNow() - t;
printf( "Eigen decompose (%dx) time = %g\n", reps, elapsed );
t = timeNow();
Eigen::VectorXd _x;
for( int i=0; i<reps; i++ ) {
_x = eigenA_lu.solve( mb );
}
elapsed = timeNow() - t;
printf( "Eigen solve (%dx) time = %g\n", reps, elapsed );
// copy soln to passed x
for( int i=0; i<dim; i++ ) {
x[i] = _x(i);
}
}
int main() {
// solve a 3x3 system many times
double A[9], b[3], x[3];
int dim = 3;
int reps = 1000000;
memcpy( b, bvals, sizeof(bvals) );
// init b vector, A is copied multiple times in run_gsl/run_eigen
run_eigen( A, b, x, dim, reps );
print_solution( x, dim, "Eigen" );
run_gsl( A, b, x, dim, reps );
print_solution( x, dim, "GSL" );
return 0;
}
This produces, for example:
Eigen decompose (1000000x) time = 0.242
Eigen solve (1000000x) time = 0.108
Eigen solve for x:
8 2 3
GSL decompose (1000000x) time = 0.049
GSL solve (1000000x) time = 0.075
GSL solve for x:
8 2 3

Your benchmark is not really fair as you are doing the copy of the input matrix twice in the Eigen version: one manually through memcpy, and one within PartialPivLU. You also let Eigen knowns that mb is a vector by declaring it as a Map<Eigen::Vectord>. Then I get (GCC5,-O3,Eigen3.3):
Eigen decompose (1000000x) time = 0.087
Eigen solve (1000000x) time = 0.036
Eigen solve for x:
8 2 3
GSL decompose (1000000x) time = 0.032
GSL solve (1000000x) time = 0.062
GSL solve for x:
8 2 3
Moreover, Eigen's PartialPivLU is not really designed for such extremely tiny matrices (see below). For 3x3 matrices, better explicitly compute the inverse (for matrices up to 4x4 it is usually, ok, but not for larger ones!). In this case you must fix the sizes at compile-time:
Eigen::PartialPivLU<Eigen::Matrix3d> eigenA_lu;
Eigen::Map<Eigen::Matrix3d> ma(avals_col);
Eigen::Map<Eigen::Vector3d> mb(b);
Eigen::Matrix3d inv;
Eigen::Vector3d _x;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
inv = ma.inverse();
}
elapsed = timeNow() - t;
printf( "Eigen decompose (%dx) time = %g\n", reps, elapsed );
t = timeNow();
for( int i=0; i<reps; i++ ) {
_x.noalias() = inv * mb;
}
elapsed = timeNow() - t;
printf( "Eigen solve (%dx) time = %g\n", reps, elapsed );
which gives me:
Eigen inverse and solve (1000000x) time = 0.0209999
Eigen solve (1000000x) time = 0.000999928
so much faster.
Now if we try a much larger problem, like 3000 x 3000, we get more than one order of magnitude of difference in favor of Eigen:
Eigen decompose (1x) time = 0.411
GSL decompose (1x) time = 6.073
This is typically the optimizations that allows such performance for large problems that also introduces some overhead for very tiny matrices.

Related

LAPACK function gets slower after first iteration

I am implementing an iterative algorithm that uses LAPACK for PSD Projections (doesn't really matter, the point is I'm calling this function over and over):
void useLAPACK(vector<double>& x, int N){
/* Locals */
int n = N, il, iu, m, lda = N, ldz = N, info, lwork, liwork;
double abstol;
double vl,vu;
int iwkopt;
int* iwork;
double wkopt;
double* work;
/* Local arrays */
int isuppz[N];
double w[N], z[N*N];
/* Negative abstol means using the default value */
abstol = -1.0;
/* Set il, iu to compute NSELECT smallest eigenvalues */
vl = 0;
vu = 1.79769e+308;
/* Query and allocate the optimal workspace */
lwork = -1;
liwork = -1;
dsyevr_( (char*)"Vectors", (char*)"V", (char*)"Upper", &n, &x[0], &lda, &vl, &vu, &il, &iu,
&abstol, &m, w, z, &ldz, isuppz, &wkopt, &lwork, &iwkopt, &liwork,
&info );
lwork = (int)wkopt;
work = (double*)malloc( lwork*sizeof(double) );
liwork = iwkopt;
iwork = (int*)malloc( liwork*sizeof(int) );
/* Solve eigenproblem */
dsyevr_( (char*)"Vectors", (char*)"V", (char*)"Upper", &n, &x[0], &lda, &vl, &vu, &il, &iu,
&abstol, &m, w, z, &ldz, isuppz, work, &lwork, iwork, &liwork,
&info );
/* Check for convergence */
if( info > 0 ) {
printf( "The dsyevr (useLAPACK) failed to compute eigenvalues.\n" );
exit( 1 );
}
/* Print the number of eigenvalues found */
//printf( "\n The total number of eigenvalues found:%2i\n", m );
//print_matrix( "Selected eigenvalues", 1, m, w, 1 );
//print_matrix( "Selected eigenvectors (stored columnwise)", n, m, z, ldz );
//Eigenvectors are returned as stacked columns (in total m)
//Outer sum calculation is fastest.
for(int i = 0; i < N*N; ++i) x[i] = 0;
double lambda;
double vrow1,vrow2;
for(int col = 0; col < m; ++col) {
lambda = w[col];
for (int row1 = 0; row1 < N; ++row1) {
vrow1 = z[N*col+row1];
for(int row2 = 0; row2 < N; ++row2){
vrow2 = z[N*col+row2];
x[row1*N+row2] += lambda*vrow1*vrow2;
}
}
}
free( (void*)iwork );
free( (void*)work );
}
Now my time measurements show that the first call takes about 4ms, but then it increases to 100ms. Is there a good explanation for this in this code? x is the same vector every time.
I think I have figured out the problem. My algorithm starts with the zero matrix and afterwards the number of positive eigenvalues are more or less half positive half negative. dsyevr only calculates positive eigenvalues and corresponding eigenvectors with those arguments. I suppose if all are zero it doesn't really have to calculate any eigenvectors, which makes the algorithm much faster. Thanks for all the answers and sorry about the missing information.

ifft results are different from original signal

FFT works fine, but when I want to take IFFT I always see the same graph from its results. Results are complex and graph always the same regardless of the original signal.
in real part graph is a -sin with period = frame size
in imaginary part it is a -cos with the same period
Where can be a problem?
original signal:
IFFT real value (on pics are only half of frame):
Algorithm FFT that I use.
double** FFT(double** f, int s, bool inverse) {
if (s == 1) return f;
int sH = s / 2;
double** fOdd = new double*[sH];
double** fEven = new double*[sH];
for (int i = 0; i < sH; i++) {
int j = 2 * i;
fOdd[i] = f[j];
fEven[i] = f[j + 1];
}
double** sOdd = FFT(fOdd, sH, inverse);
double** sEven = FFT(fEven, sH, inverse);
double**spectr = new double*[s];
double arg = inverse ? DoublePI / s : -DoublePI / s;
double*oBase = new double[2]{ cos(arg),sin(arg) };
double*o = new double[2]{ 1,0 };
for (int i = 0; i < sH; i++) {
double* sO1 = Mul(o, sOdd[i]);
spectr[i] = Sum(sEven[i], sO1);
spectr[i + sH] = Dif(sEven[i], sO1);
o = Mul(o, oBase);
}
return spectr;
}
The "butterfly" portion is applying the coefficients incorrectly:
for (int i = 0; i < sH; i++) {
double* sO1 = sOdd[i];
double* sE1 = Mul(o, sEven[i]);
spectr[i] = Sum(sO1, sE1);
spectr[i + sH] = Dif(sO1, sE1);
o = Mul(o, oBase);
}
Side Note:
I kept your notation but it makes things confusing:
fOdd has indexes 0, 2, 4, 6, ... so it should be fEven
fEven has indexes 1, 3, 5, 7, ... so it should be fOdd
really sOdd should be sLower and sEven should be sUpper since they correspond to the 0:s/2 and s/2:s-1 elements of the spectrum respectively:
sLower = FFT(fEven, sH, inverse); // fEven is 0, 2, 4, ...
sUpper = FFT(fOdd, sH, inverse); // fOdd is 1, 3, 5, ...
Then the butterfly becomes:
for (int i = 0; i < sH; i++) {
double* sL1 = sLower[i];
double* sU1 = Mul(o, sUpper[i]);
spectr[i] = Sum(sL1, sU1);
spectr[i + sH] = Dif(sL1, sU1);
o = Mul(o, oBase);
}
When written like this it is easier to compare to this pseudocode example on wikipedia.
And #Dai is correct you are going to leak a lot of memory
Regarding the memory, you can use the std::vector to encapsulate dynamically-allocated arrays and to ensure they're deallocated when execution leaves scope. You could use unique_ptr<double[]> but the performance gains are not worth it IMO and you lose the safety of the at() method.
(Based on #Robb's answer)
A few other tips:
Avoid cryptic identifiers - programs should be readable, and names like "f" and "s" make your program harder to read and maintain.
Type-based Hungarian notation is frowned upon as modern editors show type information automatically so it adds unnecessary complication to identifier names.
Use size_t for indexes, not int
The STL is your friend, use it!
Preemptively prevent bugs by using const to prevent accidental mutation of read-only data.
Like so:
#include <vector>
using namespace std;
vector<double> fastFourierTransform(const vector<double> signal, const bool inverse) {
if( signal.size() < 2 ) return signal;
const size_t half = signal.size() / 2;
vector<double> lower; lower.reserve( half );
vector<double> upper; upper.reserve( half );
bool isEven = true;
for( size_t i = 0; i < signal.size(); i++ ) {
if( isEven ) lower.push_back( signal.at( i ) );
else upper.push_back( signal.at( i ) );
isEven = !isEven;
}
vector<double> lowerFft = fastFourierTransform( lower, inverse );
vector<double> upperFft = fastFourierTransform( upper, inverse );
vector<double> result;
result.reserve( signal.size() );
double arg = ( inverse ? 1 : -1 ) * ( DoublePI / signal.size() );
// Ideally these should be local `double` values passed directly into `Mul`.
unique_ptr<double[]> oBase = make_unique<double[]>( 2 );
oBase[0] = cos(arg);
oBase[1] = sin(arg);
unique_ptr<double[]> o = make_unique<double[]>( 2 );
o[0] = 0;
o[1] = 0;
for( size_t i = 0; i < half; i++ ) {
double* lower1 = lower.at( i );
double* upper1 = Mul( o, upper.at( i ) );
result.at( i ) = Sum( lower1, upper1 );
result.at( i + half ) = Dif( lower1, upper1 );
o = Mul( o, oBase );
}
// My knowledge of move-semantics of STL containers is a bit rusty - so there's probably a better way to return the output 'result' vector.
return result;
}

function template parametrized by other function with different number of arguments

I'm able to make function template parametrized by an other function, however, I don't know how to do it when I want to parametrize it by function with different number of arguments.
See this code:
#include <stdio.h>
#include <math.h>
template < double FUNC( double a ) >
void seq_op( int n, double * as ){
for (int i=0; i<n; i++){ printf( " %f \n", FUNC( as[i] ) ); }
}
template < double FUNC( double a, double b ) >
void seq_op_2( int n, double * as, double * bs ){
for (int i=0; i<n; i++){ printf( " %f \n", FUNC( as[i], bs[i] ) ); }
}
double a_plus_1 ( double a ){ return a + 1.0; }
double a_sq ( double a ){ return a*a; }
double a_plus_b ( double a, double b ){ return a + b; }
double a_times_b( double a, double b ){ return a * b; }
double as[5] = {1,2,3,4};
double bs[5] = {2,2,2,2};
// FUNCTION ====== main
int main(){
printf( "seq_op <a_plus_1> ( 5, as );\n"); seq_op <a_plus_1> ( 4, as );
printf( "seq_op <a_sq> ( 5, as );\n"); seq_op <a_sq> ( 4, as );
printf( "seq_op_2 <a_plus_b> ( 5, as, bs );\n"); seq_op_2 <a_plus_b> ( 4, as, bs );
printf( "seq_op_2 <a_times_b> ( 5, as, bs );\n"); seq_op_2 <a_times_b> ( 4, as, bs );
}
is there a way how to make common template for both cases?
Why I need such silly thing? A more practical example are this two functions which differs only in one line:
#define i3D( ix, iy, iz ) ( iz*nxy + iy*nx + ix )
void getLenardJonesFF( int natom, double * Rs_, double * C6, double * C12 ){
Vec3d * Rs = (Vec3d*) Rs_;
int nx = FF::n.x;
int ny = FF::n.y;
int nz = FF::n.z;
int nxy = ny * nx;
Vec3d rProbe; rProbe.set( 0.0, 0.0, 0.0 ); // we may shift here
for ( int ia=0; ia<nx; ia++ ){
printf( " ia %i \n", ia );
rProbe.add( FF::dCell.a );
for ( int ib=0; ib<ny; ib++ ){
rProbe.add( FF::dCell.b );
for ( int ic=0; ic<nz; ic++ ){
rProbe.add( FF::dCell.c );
Vec3d f; f.set(0.0,0.0,0.0);
for(int iatom=0; iatom<natom; iatom++){
// only this line differs
f.add( forceLJ( Rs[iatom] - rProbe, C6[iatom], C12[iatom] ) );
}
FF::grid[ i3D( ia, ib, ic ) ].add( f );
}
rProbe.add_mul( FF::dCell.c, -nz );
}
rProbe.add_mul( FF::dCell.b, -ny );
}
}
void getCoulombFF( int natom, double * Rs_, double * kQQs ){
Vec3d * Rs = (Vec3d*) Rs_;
int nx = FF::n.x;
int ny = FF::n.y;
int nz = FF::n.z;
int nxy = ny * nx;
Vec3d rProbe; rProbe.set( 0.0, 0.0, 0.0 ); // we may shift here
for ( int ia=0; ia<nx; ia++ ){
printf( " ia %i \n", ia );
rProbe.add( FF::dCell.a );
for ( int ib=0; ib<ny; ib++ ){
rProbe.add( FF::dCell.b );
for ( int ic=0; ic<nz; ic++ ){
rProbe.add( FF::dCell.c );
Vec3d f; f.set(0.0,0.0,0.0);
for(int iatom=0; iatom<natom; iatom++){
// only this line differs
f.add( forceCoulomb( Rs[iatom] - rProbe, kQQs[iatom] );
}
FF::grid[ i3D( ia, ib, ic ) ].add( f );
}
rProbe.add_mul( FF::dCell.c, -nz );
}
rProbe.add_mul( FF::dCell.b, -ny );
}
}
You should be able to combine the two functions using a combination of std::bind() and std::function() (see code on coliru):
#include <stdio.h>
#include <functional>
using namespace std::placeholders;
double getLJForceAtoms (int, int, double*, double*, double*)
{
printf("getLJForceAtoms\n");
return 0;
}
double getCoulombForceAtoms (int, int, double*, double*)
{
printf("getCoulombForceAtoms\n");
return 0;
}
void getFF (int natom, double* Rs_, std::function<double(int, int, double*)> GetForce)
{
int rProbe = 1;
double Force = GetForce(rProbe, natom, Rs_);
}
int main ()
{
double* C6 = nullptr;
double* C12 = nullptr;
double *kQQs = nullptr;
double* Rs_ = nullptr;
auto getLJForceFunc = std::bind(getLJForceAtoms, _1, _2, _3, C6, C12);
auto getCoulombForceFunc = std::bind(getCoulombForceAtoms, _1, _2, _3, kQQs);
getFF(1, Rs_, getLJForceFunc);
getFF(1, Rs_, getCoulombForceFunc);
return 0;
}
which outputs the expected:
getLJForceAtoms
getCoulombForceAtoms
Update -- On Performance
While it is natural to be concerned about performance of using std::function vs templates I would not omit a possible solution without first benchmarking and profiling it.
I can't compare the performance directly as I would need both your complete source code as well as input data set to make accurate benchmarks but I can do a very simple test to show you what it could look like. If I make the force functions do a little work:
double getLJForceAtoms (int x, int y, double* r1, double* r2, double* r3)
{
return cos(log2(abs(sin(log(pow(x, 2) + pow(y, 2))))));
}
and then have a very simple getFF() function call them 10 million times I can get a rough comparison between the various design methods (tests done on VS2013, release build, fast optimization flags):
Direct Call = 1900 ms
Switch = 1900 ms
If (flag) = 1900 ms
Virtual Function = 2400 ms
std::function = 2400 ms
So the std::function method is about 25% slower in this case but the switch and if methods are the same speed as the direct call case. Depending on how much work your actual force functions do you may get worse or better results. These days, the compiler optimizer and the CPU branch predictor are good enough to do a lot of things that may be surprising or even counter-intuitive, which is why actual testing must be done.
I would do a similar benchmark test with your exact code and data set and see what difference, if any, the various designs have. If you really only have two cases as shown in your question then the "if (flag)" method may be a good choice.

3D FFT Using Intel MKL with Zero Padding

I want to compute 3D FFT using Intel MKL of an array which has about 300×200×200 elements. This 3D array is stored as a 1D array of type double in a columnwise fashion:
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
I want to perform not-in-place FFT on the input array and prevent it from getting modified (I need to use it later in my code) and then do the backward computation in-place. I also want to have zero padding.
My questions are:
How can I perform the zero-padding?
How should I deal with the size of the arrays used by FFT functions when zero padding is included in the computation?
How can I take out the zero padded results and get the actual result?
Here is my attempt to the problem, I would be absolutely thankful for any comment, suggestion, or hint.
#include <stdio.h>
#include "mkl.h"
int max(int a, int b, int c)
{
int m = a;
(m < b) && (m = b);
(m < c) && (m = c);
return m;
}
void FFT3D_R2C( // Real to Complex 3D FFT.
double *in, int nRowsIn , int nColsIn , int nHeightsIn ,
double *out )
{
int n = max( nRowsIn , nColsIn , nHeightsIn );
// Round up to the next highest power of 2.
unsigned int N = (unsigned int) n; // compute the next highest power of 2 of 32-bit n.
N--;
N |= N >> 1;
N |= N >> 2;
N |= N >> 4;
N |= N >> 8;
N |= N >> 16;
N++;
/* Strides describe data layout in real and conjugate-even domain. */
MKL_LONG rs[4], cs[4];
// DFTI descriptor.
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
// Variables needed for out-of-place computations.
MKL_Complex16 *in_fft = new MKL_Complex16 [ N*N*N ];
MKL_Complex16 *out_fft = new MKL_Complex16 [ N*N*N ];
double *out_ZeroPadded = new double [ N*N*N ];
/* Compute strides */
rs[3] = 1; cs[3] = 1;
rs[2] = (N/2+1)*2; cs[2] = (N/2+1);
rs[1] = N*(N/2+1)*2; cs[1] = N*(N/2+1);
rs[0] = 0; cs[0] = 0;
// Create DFTI descriptor.
MKL_LONG sizes[] = { N, N, N };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
// Configure DFTI descriptor.
DftiSetValue( fft_desc, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX );
DftiSetValue( fft_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); // Out-of-place transformation.
DftiSetValue( fft_desc, DFTI_INPUT_STRIDES , rs );
DftiSetValue( fft_desc, DFTI_OUTPUT_STRIDES , cs );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
// Change strides to compute backward transform.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , cs);
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, rs);
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out_fft, out_ZeroPadded );
// Printing the zero padded 3D FFT result.
for( long long i = 0; i < (long long)N*N*N; i++ )
printf("%f\n", out_ZeroPadded[i] );
/* I don't know how to take out the zero padded results and
save the actual result in the variable named "out" */
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] out_ZeroPadded ;
}
int main()
{
int n = 10;
double *a = new double [n*n*n]; // This array is real.
double *afft = new double [n*n*n];
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Calculate FFT.
FFT3D_R2C( a, n, n, n, afft );
printf("FFT results:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", afft[i] );
delete[] a;
delete[] afft;
return 0;
}
just few hints:
Power of 2 size
I don't like the way you are computing the size
so let Nx,Ny,Nz be the size of input matrix
and nx,ny,nz size of the padded matrix
for (nx=1;nx<Nx;nx<<=1);
for (ny=1;ny<Ny;ny<<=1);
for (nz=1;nz<Nz;nz<<=1);
now zero pad by memset to zero first and then copy the matrix lines
padding to N^3 instead of nx*ny*nz can result in big slowdowns
if nx,ny,nz are not close to each other
output is complex
if I get it right a is input real matrix
and afft the output complex matrix
so why not allocate the space for it correctly?
double *afft = new double [2*nx*ny*nz];
complex number is real+imaginary part so 2 values per number
that goes also for the final print of result
and some "\r\n" after lines would be good for viewing
3D DFFT
I do not use nor know your DFFT library
I use mine own, but anyway 3D DFFT can be done by 1D DFFT
if you do it by the lines ... see this 2D DFCT by 1D DFFT
in 3D is the same but you need to add one pass and different normalization constant
this way you can have single line buffer double lin[2*max(nx,ny,nz)];
and make the zero padding on the run (so no need to have bigger matrix in memory)...
but that involves coping the lines on each 1D DFFT ...

3D Convolution with Intel MKL

I have written a C/C++ code which uses Intel MKL to compute the 3D convolution of an array which has about 300×200×200 elements. I want to apply a kernel which is either 3×3×3 or 5×5×5. Both the 3D input array and the kernel have real values.
This 3D array is stored as a 1D array of type double in a columnwise fashion. Similarly the kernel is of type double and is saved columnwise. For example,
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
For the computation of convolution, I want to perform not-in-place FFT on the input array and the kernel and prevent them from getting modified (I need to use them later in my code) and then do the backward computation in-place.
When I compare the result obtained from my code with the one obtained by MATLAB they are very different. Could someone kindly help me fix the issue? What is missing in my code?
Here is the MATLAB code I used:
a = ones( 10, 10, 10 );
kernel = ones( 3, 3, 3 );
aconvolved = convn( a, kernel, 'same' );
Here is my C/C++ code:
#include <stdio.h>
#include "mkl.h"
void Conv3D(
double *in, double *ker, double *out,
int nRows, int nCols, int nHeights)
{
int NI = nRows;
int NJ = nCols;
int NK = nHeights;
double *in_fft = new double [NI*NJ*NK];
double *ker_fft = new double [NI*NJ*NK];
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
MKL_LONG sizes[] = { NK, NJ, NI };
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
DftiSetValue ( fft_desc, DFTI_PLACEMENT , DFTI_NOT_INPLACE); // Out-of-place computation.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , strides );
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, strides );
DftiSetValue ( fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
DftiComputeForward ( fft_desc, ker, ker_fft );
for (long long i = 0; i < (long long)NI*NJ*NK; ++i )
out[i] = in_fft[i]*ker_fft[i];
// In-place computation.
DftiSetValue ( fft_desc, DFTI_PLACEMENT, DFTI_INPLACE );
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out );
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] ker_fft;
}
int main(int argc, char* argv[])
{
int n = 10;
int nkernel = 3;
double *a = new double [n*n*n]; // This array is real.
double *aconvolved = new double [n*n*n]; // The convolved array is also real.
double *kernel = new double [nkernel*nkernel*nkernel]; // kernel is real.
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Fill the kernel with some 'real' numbers.
for( int i = 0; i < nkernel*nkernel*nkernel; i++ )
kernel[ i ] = 1.0;
// Calculate the convolution.
Conv3D( a, kernel, aconvolved, n, n, n );
printf("Convolved:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", aconvolved[i] );
delete[] a;
delete[] kernel;
delete[] aconvolved;
return 0;
}
You can't reverse the FFT with real-valued frequency data (just the magnitude). A forward FFT needs to output complex data. This is done by setting the DFTI_FORWARD_DOMAIN setting to DFTI_COMPLEX.
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes );
Doing this implicitly sets the backward domain to complex too.
You will also need a complex data type. Probably something like,
MKL_Complex16* in_fft = new MKL_Complex16[NI*NJ*NK];
This means you will have to multiply both the real and imaginary parts:
for (size_t i = 0; i < (size_t)NI*NJ*NK; ++i) {
out_fft[i].real = in_fft[i].real * ker_fft[i].real;
out_fft[i].imag = in_fft[i].imag * ker_fft[i].imag;
}
The output of the inverse FFT is also complex, and assuming your input data is real, you can just grab the .real component and that is your result. This means you'll need a temporary complex output array (say, out_fft as above).
Also note that to avoid artifacts, you want the size of your fft to be (at least) M+N-1 on each dimension. Generally you would choose the next highest power of two for speed.
I strongly suggest you implement it in MATLAB first, using FFTs. There are many such implementations available (example), but I would start from the basics and make a simple function on your own.