LAPACK function gets slower after first iteration - c++

I am implementing an iterative algorithm that uses LAPACK for PSD Projections (doesn't really matter, the point is I'm calling this function over and over):
void useLAPACK(vector<double>& x, int N){
/* Locals */
int n = N, il, iu, m, lda = N, ldz = N, info, lwork, liwork;
double abstol;
double vl,vu;
int iwkopt;
int* iwork;
double wkopt;
double* work;
/* Local arrays */
int isuppz[N];
double w[N], z[N*N];
/* Negative abstol means using the default value */
abstol = -1.0;
/* Set il, iu to compute NSELECT smallest eigenvalues */
vl = 0;
vu = 1.79769e+308;
/* Query and allocate the optimal workspace */
lwork = -1;
liwork = -1;
dsyevr_( (char*)"Vectors", (char*)"V", (char*)"Upper", &n, &x[0], &lda, &vl, &vu, &il, &iu,
&abstol, &m, w, z, &ldz, isuppz, &wkopt, &lwork, &iwkopt, &liwork,
&info );
lwork = (int)wkopt;
work = (double*)malloc( lwork*sizeof(double) );
liwork = iwkopt;
iwork = (int*)malloc( liwork*sizeof(int) );
/* Solve eigenproblem */
dsyevr_( (char*)"Vectors", (char*)"V", (char*)"Upper", &n, &x[0], &lda, &vl, &vu, &il, &iu,
&abstol, &m, w, z, &ldz, isuppz, work, &lwork, iwork, &liwork,
&info );
/* Check for convergence */
if( info > 0 ) {
printf( "The dsyevr (useLAPACK) failed to compute eigenvalues.\n" );
exit( 1 );
}
/* Print the number of eigenvalues found */
//printf( "\n The total number of eigenvalues found:%2i\n", m );
//print_matrix( "Selected eigenvalues", 1, m, w, 1 );
//print_matrix( "Selected eigenvectors (stored columnwise)", n, m, z, ldz );
//Eigenvectors are returned as stacked columns (in total m)
//Outer sum calculation is fastest.
for(int i = 0; i < N*N; ++i) x[i] = 0;
double lambda;
double vrow1,vrow2;
for(int col = 0; col < m; ++col) {
lambda = w[col];
for (int row1 = 0; row1 < N; ++row1) {
vrow1 = z[N*col+row1];
for(int row2 = 0; row2 < N; ++row2){
vrow2 = z[N*col+row2];
x[row1*N+row2] += lambda*vrow1*vrow2;
}
}
}
free( (void*)iwork );
free( (void*)work );
}
Now my time measurements show that the first call takes about 4ms, but then it increases to 100ms. Is there a good explanation for this in this code? x is the same vector every time.

I think I have figured out the problem. My algorithm starts with the zero matrix and afterwards the number of positive eigenvalues are more or less half positive half negative. dsyevr only calculates positive eigenvalues and corresponding eigenvectors with those arguments. I suppose if all are zero it doesn't really have to calculate any eigenvectors, which makes the algorithm much faster. Thanks for all the answers and sorry about the missing information.

Related

finding global maxima of a function from comparing each processor's local maxima using MPI ring topology

I wish to use the MPI ring topology, passing each processor's maxima around the ring, comparing the local maxima and then output the global maxima for all processors.
I am using a 10 dimensional Monte Carlo integration function. My first idea was to make an array with each processor's local maxima, then pass that value, compare and output the highest value. But I couldn't elegantly code to make an array which will take only each processors' max value and store it corresponding to rank of the processor, this way I can also keep track which processor got the global maxima.
I didn't finish my code yet, right now I am interested to see if an array with local maxima from processor's can be created. the way I coded, it's very time consuming and if there is a lot of processors, then I have to declare them each time, and yet I couldn't produce the array I am looking for.
I am sharing the code here:
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <mpi.h>
using namespace std;
//define multivariate function F(x1, x2, ...xk)
double f(double x[], int n)
{
double y;
int j;
y = 0.0;
for (j = 0; j < n-1; j = j+1)
{
y = y + exp(-pow((1-x[j]),2)-100*(pow((x[j+1] - pow(x[j],2)),2)));
}
y = y;
return y;
}
//define function for Monte Carlo Multidimensional integration
double int_mcnd(double(*fn)(double[],int),double a[], double b[], int n, int m)
{
double r, x[n], v;
int i, j;
r = 0.0;
v = 1.0;
// initial seed value (use system time)
//srand(time(NULL));
// step 1: calculate the common factor V
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
// step 2: integration
for (i = 1; i <= m; i=i+1)
{
// calculate random x[] points
for (j = 0; j < n; j = j+1)
{
x[j] = a[j] + (rand()) /( (RAND_MAX/(b[j]-a[j])));
}
r = r + fn(x,n);
}
r = r*v/m;
return r;
}
double f(double[], int);
double int_mcnd(double(*)(double[],int), double[], double[], int, int);
int main(int argc, char **argv)
{
int rank, size;
MPI_Init (&argc, &argv); // initializes MPI
MPI_Comm_rank (MPI_COMM_WORLD, &rank); // get current MPI-process ID. O, 1, ...
MPI_Comm_size (MPI_COMM_WORLD, &size); // get the total number of processes
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0,5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
const unsigned int N = 5;
double max = -1;
double max_store[4];
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
srand(time(NULL) * rank); // each MPI process gets a unique seed
m = 4; // initial number of intervals
// convert command-line input to N = number of points
//N = atoi( argv[1] );
for (unsigned int i=0; i <=N; i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
if( mean > max)
{
max = mean;
}
//cout << setw(10) << m << setw(10) << max << setw(10) << mean << setw(10) << rank << setw(10) << size <<endl;
m = m*4;
}
//cout << setw(30) << m << setw(30) << result << setw(30) << mean <<endl;
printf("Process %d of %d mean = %1.5e\n and local max = %1.5e\n", rank, size, mean, max );
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
for( int k = 0; k < 4; k++ )
{
printf( "%1.5e\n", max_store[k]);
}
//double max_store[4] = {4.43095e-02, 5.76586e-02, 3.15962e-02, 4.23079e-02};
double send_junk = max_store[0];
double rec_junk;
MPI_Status status;
// This next if-statment implemeents the ring topology
// the last process ID is size-1, so the ring topology is: 0->1, 1->2, ... size-1->0
// rank 0 starts the chain of events by passing to rank 1
if(rank==0) {
// only the process with rank ID = 0 will be in this block of code.
MPI_Send(&send_junk, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); // send data to process 1
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, size-1, 0, MPI_COMM_WORLD, &status); // receive data from process size-1
}
else if( rank == size-1) {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); // send data to its "right neighbor", rank 0
}
else {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, rank+1, 0, MPI_COMM_WORLD); // send data to its "right neighbor" (rank+1)
}
printf("Process %d send %1.5e\n and recieved %1.5e\n", rank, send_junk, rec_junk );
MPI_Finalize(); // programs should always perform a "graceful" shutdown
return 0;
}
compile with :
mpiCC -o gd test_code.cpp
mpirun -np 4 ./gd
I would appreciate suggestion:
if there is a more elegant way to make local maxima arrays?
How to compare the local maxima and decide the global maxima while passing the values in a ring?
Also feel free to modify the code to provide me a better example to work with. I would appreciate any suggestion. thanks.
For this sort of thing, better using either MPI_Reduce() or MPI_Allreduce() with MPI_MAX as operator. The former will compute the max over the values exposed by all processes and give the result to the "root" process only, while the later will do the same, but give the results to all processes.
// Only process of rank 0 get the global max
MPI_Reduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD );
// All processes get the global max
MPI_Allreduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
// All processes get the global max, stored in place of the local max
// after the call ends - this might be the most interesting one for you
MPI_Allreduce( MPI_IN_PLACE, &max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
As you can see, you could just insert the 3rd example into your code to solve your problem.
BTW, unrelated remark, but this hurts my eyes:
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
What about something like this:
if ( rank < 4 && rank >= 0 ) {
max_store[rank] = max;
}

Making C++ Eigen LU faster (my tests show 2x slower than GSL)

I'm comparing LU decomposition/solve of Eigen to GSL, and find Eigen to be on the order of 2x slower with -O3 optimizations on g++/OSX. I isolated timing of the decompose and the solve, but find both to be substantially slower than their GSL counterparts. Am I doing something silly, or does Eigen not perform well for this use case (e.g. very small systems?) Built with Eigen 3.2.8 and an older GSL 1.15. The test case is very contrived, but mirrors the results in some nonlinear-fitting software I'm writing - Eigen being anywhere from 1.5x - 2x+ slower for the total LU/solve operation.
#define NDEBUG
#include "sys/time.h"
#include "gsl/gsl_linalg.h"
#include <Eigen/LU>
// Ax=b is a 3x3 system for which soln is x=[8,2,3]
//
double avals_col[9] = { 4, 2, 2, 7, 5, 5, 7, 5, 9 };
// col major
double avals_row[9] = { 4, 7, 7, 2, 5, 5, 2, 5, 9 };
// row major
double bvals[9] = { 67, 41, 53 };
//----------- helpers
void print_solution( double *x, int dim, char *which ) {
printf( "%s solve for x:\n", which );
for( int j=0; j<3; j++ ) {
printf( "%g ", x[j] );
}
printf( "\n" );
}
struct timeval tv;
struct timezone tz;
double timeNow() {
gettimeofday( &tv, &tz );
int _mils = tv.tv_usec/1000;
int _secs = tv.tv_sec;
return (double)_secs + ((double)_mils/1000.0);
}
//-----------
void run_gsl( double *A, double *b, double *x, int dim, int reps ) {
gsl_matrix_view gslA;
gsl_vector_view gslB;
gsl_vector_view gslX;
gsl_permutation *gslP;
int sign;
gslA = gsl_matrix_view_array( A, dim, dim );
gslP = gsl_permutation_alloc( dim );
gslB = gsl_vector_view_array( b, dim );
gslX = gsl_vector_view_array( x, dim );
int err;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
// gsl overwrites A during decompose, so we must copy the initial A each time.
memcpy( A, avals_row, sizeof(avals_row) );
err = gsl_linalg_LU_decomp( &gslA.matrix, gslP, &sign );
}
elapsed = timeNow() - t;
printf( "GSL decompose (%dx) time = %g\n", reps, elapsed );
t = timeNow();
for( int i=0; i<reps; i++ ) {
err = gsl_linalg_LU_solve( &gslA.matrix, gslP, &gslB.vector, &gslX.vector );
}
elapsed = timeNow() - t;
printf( "GSL solve (%dx) time = %g\n", reps, elapsed );
gsl_permutation_free( gslP );
}
void run_eigen( double *A, double *b, double *x, int dim, int reps ) {
Eigen::PartialPivLU<Eigen::MatrixXd> eigenA_lu;
Eigen::Map< Eigen::Matrix < double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor > > ma( A, dim, dim );
Eigen::Map<Eigen::MatrixXd> mb( b, dim, 1 );
int err;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
// This memcpy is not necessary for Eigen, which does not overwrite A in the
// decompose, but do it so that the time is directly comparable to GSL.
memcpy( A, avals_col, sizeof(avals_col) );
eigenA_lu.compute( ma );
}
elapsed = timeNow() - t;
printf( "Eigen decompose (%dx) time = %g\n", reps, elapsed );
t = timeNow();
Eigen::VectorXd _x;
for( int i=0; i<reps; i++ ) {
_x = eigenA_lu.solve( mb );
}
elapsed = timeNow() - t;
printf( "Eigen solve (%dx) time = %g\n", reps, elapsed );
// copy soln to passed x
for( int i=0; i<dim; i++ ) {
x[i] = _x(i);
}
}
int main() {
// solve a 3x3 system many times
double A[9], b[3], x[3];
int dim = 3;
int reps = 1000000;
memcpy( b, bvals, sizeof(bvals) );
// init b vector, A is copied multiple times in run_gsl/run_eigen
run_eigen( A, b, x, dim, reps );
print_solution( x, dim, "Eigen" );
run_gsl( A, b, x, dim, reps );
print_solution( x, dim, "GSL" );
return 0;
}
This produces, for example:
Eigen decompose (1000000x) time = 0.242
Eigen solve (1000000x) time = 0.108
Eigen solve for x:
8 2 3
GSL decompose (1000000x) time = 0.049
GSL solve (1000000x) time = 0.075
GSL solve for x:
8 2 3
Your benchmark is not really fair as you are doing the copy of the input matrix twice in the Eigen version: one manually through memcpy, and one within PartialPivLU. You also let Eigen knowns that mb is a vector by declaring it as a Map<Eigen::Vectord>. Then I get (GCC5,-O3,Eigen3.3):
Eigen decompose (1000000x) time = 0.087
Eigen solve (1000000x) time = 0.036
Eigen solve for x:
8 2 3
GSL decompose (1000000x) time = 0.032
GSL solve (1000000x) time = 0.062
GSL solve for x:
8 2 3
Moreover, Eigen's PartialPivLU is not really designed for such extremely tiny matrices (see below). For 3x3 matrices, better explicitly compute the inverse (for matrices up to 4x4 it is usually, ok, but not for larger ones!). In this case you must fix the sizes at compile-time:
Eigen::PartialPivLU<Eigen::Matrix3d> eigenA_lu;
Eigen::Map<Eigen::Matrix3d> ma(avals_col);
Eigen::Map<Eigen::Vector3d> mb(b);
Eigen::Matrix3d inv;
Eigen::Vector3d _x;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
inv = ma.inverse();
}
elapsed = timeNow() - t;
printf( "Eigen decompose (%dx) time = %g\n", reps, elapsed );
t = timeNow();
for( int i=0; i<reps; i++ ) {
_x.noalias() = inv * mb;
}
elapsed = timeNow() - t;
printf( "Eigen solve (%dx) time = %g\n", reps, elapsed );
which gives me:
Eigen inverse and solve (1000000x) time = 0.0209999
Eigen solve (1000000x) time = 0.000999928
so much faster.
Now if we try a much larger problem, like 3000 x 3000, we get more than one order of magnitude of difference in favor of Eigen:
Eigen decompose (1x) time = 0.411
GSL decompose (1x) time = 6.073
This is typically the optimizations that allows such performance for large problems that also introduces some overhead for very tiny matrices.

Why does this LAPACK program work correctly when I provide the matrix directly, but not when I read it from a file?

Below comes a LAPACK code for diagonalising a matrix A, which I provide in the form of an array a. It is but a slight modification of an official example and appears to produce correct results. It is impractical, because I have to provide the array a directly.
#include <stdlib.h>
#include <stdio.h>
#include <fstream>
#include <vector>
/* DSYEV prototype */
extern "C"{
void dsyev( char* jobz, char* uplo, int* n, double* a, int* lda,
double* w, double* work, int* lwork, int* info );
}
/* Auxiliary routines prototypes */
extern "C"{
void print_matrix( char* desc, int m, int n, double* a, int lda );
}
/* Parameters */
#define N 5
#define LDA N
/* Main program */
int main() {
/* Locals */
int n = N, lda = LDA, info, lwork;
double wkopt;
double* work;
/* Local arrays */
double w[N];
double a[LDA*N] = {
1.96, 0.00, 0.00, 0.00, 0.00,
-6.49, 3.80, 0.00, 0.00, 0.00,
-0.47, -6.39, 4.17, 0.00, 0.00,
-7.20, 1.50, -1.51, 5.70, 0.00,
-0.65, -6.34, 2.67, 1.80, -7.10
};
/* Executable statements */
printf( " DSYEV Example Program Results\n" );
/* Query and allocate the optimal workspace */
lwork = -1;
dsyev( "Vectors", "Upper", &n, a, &lda, w, &wkopt, &lwork, &info );
lwork = (int)wkopt;
work = (double*)malloc( lwork*sizeof(double) );
/* Solve eigenproblem */
dsyev( "Vectors", "Upper", &n, a, &lda, w, work, &lwork, &info );
/* Check for convergence */
if( info > 0 ) {
printf( "The algorithm failed to compute eigenvalues.\n" );
exit( 1 );
}
/* Print eigenvalues */
print_matrix( "Eigenvalues", 1, n, w, 1 );
/* Print eigenvectors */
print_matrix( "Eigenvectors (stored columnwise)", n, n, a, lda );
/* Free workspace */
free( (void*)work );
exit( 0 );
} /* End of DSYEV Example */
/* Auxiliary routine: printing a matrix */
void print_matrix( char* desc, int m, int n, double* a, int lda ) {
int i, j;
printf( "\n %s\n", desc );
for( i = 0; i < m; i++ ) {
for( j = 0; j < n; j++ ) printf( " %6.2f", a[i+j*lda] );
printf( "\n" );
}
}
I merely want to modify the above code, so that I can read the array from a file instead of providing it directly. To that end I wrote the function read_covariance that reads the array from a file peano_covariance.data. The contents of the latter data file are:
1.96 0.00 0.00 0.00 0.00
-6.49 3.80 0.00 0.00 0.00
-0.47 -6.39 4.17 0.00 0.00
-7.20 1.50 -1.51 5.70 0.00
-0.65 -6.34 2.67 1.80 -7.10
Below is my attempt, which produces very incorrect eigenvalues and eigenvectors.
#include <stdlib.h>
#include <stdio.h>
#include <fstream>
#include <vector>
int read_covariance (std::vector<double> data)
{
double tmp;
std::ifstream fin("peano_covariance.data");
while(fin >> tmp)
{
data.push_back(tmp);
}
return 0;
}
/* DSYEV prototype */
extern "C"{
void dsyev( char* jobz, char* uplo, int* n, double* a, int* lda,
double* w, double* work, int* lwork, int* info );
}
/* Auxiliary routines prototypes */
extern "C"{
void print_matrix( char* desc, int m, int n, double* a, int lda );
}
/* Parameters */
#define N 5
#define LDA N
/* Main program */
int main() {
/* Locals */
std::vector<double> data;
int n = N, lda = LDA, info, lwork;
double wkopt;
double* work;
/* Local arrays */
double w[N];
double a[LDA*N];
read_covariance(data);
std::copy(data.begin(), data.end(), a);
/* Executable statements */
printf( " DSYEV Example Program Results\n" );
/* Query and allocate the optimal workspace */
lwork = -1;
dsyev( "Vectors", "Upper", &n, a, &lda, w, &wkopt, &lwork, &info );
lwork = (int)wkopt;
work = (double*)malloc( lwork*sizeof(double) );
/* Solve eigenproblem */
dsyev( "Vectors", "Upper", &n, a, &lda, w, work, &lwork, &info );
/* Check for convergence */
if( info > 0 ) {
printf( "The algorithm failed to compute eigenvalues.\n" );
exit( 1 );
}
/* Print eigenvalues */
print_matrix( "Eigenvalues", 1, n, w, 1 );
/* Print eigenvectors */
print_matrix( "Eigenvectors (stored columnwise)", n, n, a, lda );
/* Free workspace */
free( (void*)work );
exit( 0 );
} /* End of DSYEV Example */
/* Auxiliary routine: printing a matrix */
void print_matrix( char* desc, int m, int n, double* a, int lda ) {
int i, j;
printf( "\n %s\n", desc );
for( i = 0; i < m; i++ ) {
for( j = 0; j < n; j++ ) printf( " %e", a[i+j*lda] );
printf( "\n" );
}
}
Replace
int read_covariance (std::vector<double> data)
with
int read_covariance (std::vector<double> & data)
You are sending in a copy of the array rather than a reference to it. It is the temporary copy that is being filled with values. This is what bg2b is referring to in his comment.
Personally, though, I would rather write something like
int read_covariance (const std::string & fname)
{
std::ifstream in(fname.c_str());
double val;
std::vector<double> cov;
while(in >> val) cov.push_back(val);
return cov;
}
Even better would be to use a proper multidimensional array library rather than unwieldy 1d vectors. There's a plethora of such libraries available, and I'm not sure which is the best one (the lack of a good multidimensional array class in the C++ standard library is one of the main reasons why I often use fortran instead), but ndarray looks interesting - it aims to mimic the features of the excellent numpy array module for python.

3D FFT Using Intel MKL with Zero Padding

I want to compute 3D FFT using Intel MKL of an array which has about 300×200×200 elements. This 3D array is stored as a 1D array of type double in a columnwise fashion:
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
I want to perform not-in-place FFT on the input array and prevent it from getting modified (I need to use it later in my code) and then do the backward computation in-place. I also want to have zero padding.
My questions are:
How can I perform the zero-padding?
How should I deal with the size of the arrays used by FFT functions when zero padding is included in the computation?
How can I take out the zero padded results and get the actual result?
Here is my attempt to the problem, I would be absolutely thankful for any comment, suggestion, or hint.
#include <stdio.h>
#include "mkl.h"
int max(int a, int b, int c)
{
int m = a;
(m < b) && (m = b);
(m < c) && (m = c);
return m;
}
void FFT3D_R2C( // Real to Complex 3D FFT.
double *in, int nRowsIn , int nColsIn , int nHeightsIn ,
double *out )
{
int n = max( nRowsIn , nColsIn , nHeightsIn );
// Round up to the next highest power of 2.
unsigned int N = (unsigned int) n; // compute the next highest power of 2 of 32-bit n.
N--;
N |= N >> 1;
N |= N >> 2;
N |= N >> 4;
N |= N >> 8;
N |= N >> 16;
N++;
/* Strides describe data layout in real and conjugate-even domain. */
MKL_LONG rs[4], cs[4];
// DFTI descriptor.
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
// Variables needed for out-of-place computations.
MKL_Complex16 *in_fft = new MKL_Complex16 [ N*N*N ];
MKL_Complex16 *out_fft = new MKL_Complex16 [ N*N*N ];
double *out_ZeroPadded = new double [ N*N*N ];
/* Compute strides */
rs[3] = 1; cs[3] = 1;
rs[2] = (N/2+1)*2; cs[2] = (N/2+1);
rs[1] = N*(N/2+1)*2; cs[1] = N*(N/2+1);
rs[0] = 0; cs[0] = 0;
// Create DFTI descriptor.
MKL_LONG sizes[] = { N, N, N };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
// Configure DFTI descriptor.
DftiSetValue( fft_desc, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX );
DftiSetValue( fft_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); // Out-of-place transformation.
DftiSetValue( fft_desc, DFTI_INPUT_STRIDES , rs );
DftiSetValue( fft_desc, DFTI_OUTPUT_STRIDES , cs );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
// Change strides to compute backward transform.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , cs);
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, rs);
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out_fft, out_ZeroPadded );
// Printing the zero padded 3D FFT result.
for( long long i = 0; i < (long long)N*N*N; i++ )
printf("%f\n", out_ZeroPadded[i] );
/* I don't know how to take out the zero padded results and
save the actual result in the variable named "out" */
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] out_ZeroPadded ;
}
int main()
{
int n = 10;
double *a = new double [n*n*n]; // This array is real.
double *afft = new double [n*n*n];
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Calculate FFT.
FFT3D_R2C( a, n, n, n, afft );
printf("FFT results:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", afft[i] );
delete[] a;
delete[] afft;
return 0;
}
just few hints:
Power of 2 size
I don't like the way you are computing the size
so let Nx,Ny,Nz be the size of input matrix
and nx,ny,nz size of the padded matrix
for (nx=1;nx<Nx;nx<<=1);
for (ny=1;ny<Ny;ny<<=1);
for (nz=1;nz<Nz;nz<<=1);
now zero pad by memset to zero first and then copy the matrix lines
padding to N^3 instead of nx*ny*nz can result in big slowdowns
if nx,ny,nz are not close to each other
output is complex
if I get it right a is input real matrix
and afft the output complex matrix
so why not allocate the space for it correctly?
double *afft = new double [2*nx*ny*nz];
complex number is real+imaginary part so 2 values per number
that goes also for the final print of result
and some "\r\n" after lines would be good for viewing
3D DFFT
I do not use nor know your DFFT library
I use mine own, but anyway 3D DFFT can be done by 1D DFFT
if you do it by the lines ... see this 2D DFCT by 1D DFFT
in 3D is the same but you need to add one pass and different normalization constant
this way you can have single line buffer double lin[2*max(nx,ny,nz)];
and make the zero padding on the run (so no need to have bigger matrix in memory)...
but that involves coping the lines on each 1D DFFT ...

3D Convolution with Intel MKL

I have written a C/C++ code which uses Intel MKL to compute the 3D convolution of an array which has about 300×200×200 elements. I want to apply a kernel which is either 3×3×3 or 5×5×5. Both the 3D input array and the kernel have real values.
This 3D array is stored as a 1D array of type double in a columnwise fashion. Similarly the kernel is of type double and is saved columnwise. For example,
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
For the computation of convolution, I want to perform not-in-place FFT on the input array and the kernel and prevent them from getting modified (I need to use them later in my code) and then do the backward computation in-place.
When I compare the result obtained from my code with the one obtained by MATLAB they are very different. Could someone kindly help me fix the issue? What is missing in my code?
Here is the MATLAB code I used:
a = ones( 10, 10, 10 );
kernel = ones( 3, 3, 3 );
aconvolved = convn( a, kernel, 'same' );
Here is my C/C++ code:
#include <stdio.h>
#include "mkl.h"
void Conv3D(
double *in, double *ker, double *out,
int nRows, int nCols, int nHeights)
{
int NI = nRows;
int NJ = nCols;
int NK = nHeights;
double *in_fft = new double [NI*NJ*NK];
double *ker_fft = new double [NI*NJ*NK];
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
MKL_LONG sizes[] = { NK, NJ, NI };
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
DftiSetValue ( fft_desc, DFTI_PLACEMENT , DFTI_NOT_INPLACE); // Out-of-place computation.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , strides );
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, strides );
DftiSetValue ( fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
DftiComputeForward ( fft_desc, ker, ker_fft );
for (long long i = 0; i < (long long)NI*NJ*NK; ++i )
out[i] = in_fft[i]*ker_fft[i];
// In-place computation.
DftiSetValue ( fft_desc, DFTI_PLACEMENT, DFTI_INPLACE );
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out );
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] ker_fft;
}
int main(int argc, char* argv[])
{
int n = 10;
int nkernel = 3;
double *a = new double [n*n*n]; // This array is real.
double *aconvolved = new double [n*n*n]; // The convolved array is also real.
double *kernel = new double [nkernel*nkernel*nkernel]; // kernel is real.
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Fill the kernel with some 'real' numbers.
for( int i = 0; i < nkernel*nkernel*nkernel; i++ )
kernel[ i ] = 1.0;
// Calculate the convolution.
Conv3D( a, kernel, aconvolved, n, n, n );
printf("Convolved:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", aconvolved[i] );
delete[] a;
delete[] kernel;
delete[] aconvolved;
return 0;
}
You can't reverse the FFT with real-valued frequency data (just the magnitude). A forward FFT needs to output complex data. This is done by setting the DFTI_FORWARD_DOMAIN setting to DFTI_COMPLEX.
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes );
Doing this implicitly sets the backward domain to complex too.
You will also need a complex data type. Probably something like,
MKL_Complex16* in_fft = new MKL_Complex16[NI*NJ*NK];
This means you will have to multiply both the real and imaginary parts:
for (size_t i = 0; i < (size_t)NI*NJ*NK; ++i) {
out_fft[i].real = in_fft[i].real * ker_fft[i].real;
out_fft[i].imag = in_fft[i].imag * ker_fft[i].imag;
}
The output of the inverse FFT is also complex, and assuming your input data is real, you can just grab the .real component and that is your result. This means you'll need a temporary complex output array (say, out_fft as above).
Also note that to avoid artifacts, you want the size of your fft to be (at least) M+N-1 on each dimension. Generally you would choose the next highest power of two for speed.
I strongly suggest you implement it in MATLAB first, using FFTs. There are many such implementations available (example), but I would start from the basics and make a simple function on your own.