Accelerating FFTW pruning to avoid massive zero padding

Accelerating FFTW pruning to avoid massive zero padding - c++

Suppose that I have a sequence x(n) which is K * N long and that only the first N elements are different from zero. I'm assuming that N << K, say, for example, N = 10 and K = 100000. I want to calculate the FFT, by FFTW, of such a sequence. This is equivalent to having a sequence of length N and having a zero padding to K * N. Since N and K may be "large", I have a significant zero padding. I'm exploring if I can save some computation time avoid explicit zero padding.
The case K = 2
Let us begin by considering the case K = 2. In this case, the DFT of x(n) can be written as
If k is even, namely k = 2 * m, then
which means that such values of the DFT can be calculated through an FFT of a sequence of length N, and not K * N.
If k is odd, namely k = 2 * m + 1, then
which means that such values of the DFT can be again calculated through an FFT of a sequence of length N, and not K * N.
So, in conclusion, I can exchange a single FFT of length 2 * N with 2 FFTs of length N.
The case of arbitrary K
In this case, we have
On writing k = m * K + t, we have
So, in conclusion, I can exchange a single FFT of length K * N with K FFTs of length N. Since the FFTW has fftw_plan_many_dft, I can expect to have some gaining against the case of a single FFT.
To verify that, I have set up the following code
#include <stdio.h>
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
#include <math.h>
#include <fstream>
#include <fftw3.h>
#include "TimingCPU.h"
#define PI_d 3.141592653589793
void main() {
const int N = 10;
const int K = 100000;
fftw_plan plan_zp;
fftw_complex *h_x = (fftw_complex *)malloc(N * sizeof(fftw_complex));
fftw_complex *h_xzp = (fftw_complex *)calloc(N * K, sizeof(fftw_complex));
fftw_complex *h_xpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhatpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhatpruning_temp = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhat = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
// --- Random number generation of the data sequence
srand(time(NULL));
for (int k = 0; k < N; k++) {
h_x[k][0] = (double)rand() / (double)RAND_MAX;
h_x[k][1] = (double)rand() / (double)RAND_MAX;
}
memcpy(h_xzp, h_x, N * sizeof(fftw_complex));
plan_zp = fftw_plan_dft_1d(N * K, h_xzp, h_xhat, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan plan_pruning = fftw_plan_many_dft(1, &N, K, h_xpruning, NULL, 1, N, h_xhatpruning_temp, NULL, 1, N, FFTW_FORWARD, FFTW_ESTIMATE);
TimingCPU timerCPU;
timerCPU.StartCounter();
fftw_execute(plan_zp);
printf("Stadard %f\n", timerCPU.GetCounter());
timerCPU.StartCounter();
double factor = -2. * PI_d / (K * N);
for (int k = 0; k < K; k++) {
double arg1 = factor * k;
for (int n = 0; n < N; n++) {
double arg = arg1 * n;
double cosarg = cos(arg);
double sinarg = sin(arg);
h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg;
h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg;
}
}
printf("Optimized first step %f\n", timerCPU.GetCounter());
timerCPU.StartCounter();
fftw_execute(plan_pruning);
printf("Optimized second step %f\n", timerCPU.GetCounter());
timerCPU.StartCounter();
for (int k = 0; k < K; k++) {
for (int p = 0; p < N; p++) {
h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
}
}
printf("Optimized third step %f\n", timerCPU.GetCounter());
double rmserror = 0., norm = 0.;
for (int n = 0; n < N; n++) {
rmserror = rmserror + (h_xhatpruning[n][0] - h_xhat[n][0]) * (h_xhatpruning[n][0] - h_xhat[n][0]) + (h_xhatpruning[n][1] - h_xhat[n][1]) * (h_xhatpruning[n][1] - h_xhat[n][1]);
norm = norm + h_xhat[n][0] * h_xhat[n][0] + h_xhat[n][1] * h_xhat[n][1];
}
printf("rmserror %f\n", 100. * sqrt(rmserror / norm));
fftw_destroy_plan(plan_zp);
}
The approach I have developed consists of three steps:
Multiplying the input sequence by "twiddle" complex exponentials;
Performing the fftw_many;
Reorganizing the results.
The fftw_many is faster than the single FFTW on K * N input points. However, steps #1 and #3 completely destroy such a gain. I would expect that steps #1 and #3 be computationally much lighter than step #2.
My questions are:
How is it possible that steps #1 and #3 a so computationally more demanding than step #2?
How can I improve steps #1 and #3 to have a net gain against the "standard" approach?
Thank you very much for any hint.
EDIT
I'm working with Visual Studio 2013 and compiling in Release mode.

Several options to run faster:
Run multi-threaded if you're only running single-threaded and have multiple cores available.
Create and save an FFTW wisdom file, especially if the FFT dimensions are known in advance. Use FFTW_EXHAUSTIVE, and reload the FFTW wisdom instead of recalculating it every time. This is also important if you want your results to be consistent. Since FFTW may compute FFTs differently with different calculated wisdom, and the wisdom results aren't necessarily going to always be the same, different runs of your process may produce different results when both are given identical input data.
If you're on x86, run 64-bit. The FFTW algorithm is extremely register-intensive, and an x86 CPU running in 64-bit mode has a lot more general-purpose registers available than it does when running in 32-bit mode.
Since the FFTW algorithm is so register intensive, I've had good success improving FFTW performance by compiling FFTW with compiler options that prevent the use of prefetching and prevent the implicit inlining of functions.

For the third step you might want to try switching the order of the loops:
for (int p = 0; p < N; p++) {
for (int k = 0; k < K; k++) {
h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
}
}
since it's generally more beneficial to have the store addresses be contiguous than the load addresses.
Either way you have a cache-unfriendly access pattern though. You could try working with blocks to improve this, e.g. assuming N is a multiple of 4:
for (int p = 0; p < N; p += 4) {
for (int k = 0; k < K; k++) {
for (int p0 = 0; p0 < 4; p0++) {
h_xhatpruning[(p + p0) * K + k][0] = h_xhatpruning_temp[(p + p0) + k * N][0];
h_xhatpruning[(p + p0) * K + k][1] = h_xhatpruning_temp[(p + p0) + k * N][1];
}
}
}
This should help to reduce the churn of cache lines somewhat. If it does then maybe also experiment with block sizes other than 4 to see if there is a "sweet spot".

Also following Paul R's comments, I have improved my code. Now, the alternative approach is faster than the standard (zero padded) one. Below is the full C++ script. For step #1 and #3, I have commented other tried solutions, which have shown to be slower or as fast as the uncommented one. I have priviledged non-nested for loops, also in view of a simpler future CUDA parallelization. I'm not yet using multi-threading for FFTW.
#include <stdio.h>
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
#include <math.h>
#include <fstream>
#include <omp.h>
#include <fftw3.h>
#include "TimingCPU.h"
#define PI_d 3.141592653589793
/******************/
/* STEP #1 ON CPU */
/******************/
void step1CPU(fftw_complex * __restrict h_xpruning, const fftw_complex * __restrict h_x, const int N, const int K) {
// double factor = -2. * PI_d / (K * N);
// int n;
// omp_set_nested(1);
//#pragma omp parallel for private(n) num_threads(4)
// for (int k = 0; k < K; k++) {
// double arg1 = factor * k;
//#pragma omp parallel for num_threads(4)
// for (n = 0; n < N; n++) {
// double arg = arg1 * n;
// double cosarg = cos(arg);
// double sinarg = sin(arg);
// h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg;
// h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg;
// }
// }
//double factor = -2. * PI_d / (K * N);
//int k;
//omp_set_nested(1);
//#pragma omp parallel for private(k) num_threads(4)
//for (int n = 0; n < N; n++) {
// double arg1 = factor * n;
// #pragma omp parallel for num_threads(4)
// for (k = 0; k < K; k++) {
// double arg = arg1 * k;
// double cosarg = cos(arg);
// double sinarg = sin(arg);
// h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg;
// h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg;
// }
//}
//double factor = -2. * PI_d / (K * N);
//for (int k = 0; k < K; k++) {
// double arg1 = factor * k;
// for (int n = 0; n < N; n++) {
// double arg = arg1 * n;
// double cosarg = cos(arg);
// double sinarg = sin(arg);
// h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg;
// h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg;
// }
//}
//double factor = -2. * PI_d / (K * N);
//for (int n = 0; n < N; n++) {
// double arg1 = factor * n;
// for (int k = 0; k < K; k++) {
// double arg = arg1 * k;
// double cosarg = cos(arg);
// double sinarg = sin(arg);
// h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg;
// h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg;
// }
//}
double factor = -2. * PI_d / (K * N);
#pragma omp parallel for num_threads(8)
for (int n = 0; n < K * N; n++) {
int row = n / N;
int col = n % N;
double arg = factor * row * col;
double cosarg = cos(arg);
double sinarg = sin(arg);
h_xpruning[n][0] = h_x[col][0] * cosarg - h_x[col][1] * sinarg;
h_xpruning[n][1] = h_x[col][0] * sinarg + h_x[col][1] * cosarg;
}
}
/******************/
/* STEP #3 ON CPU */
/******************/
void step3CPU(fftw_complex * __restrict h_xhatpruning, const fftw_complex * __restrict h_xhatpruning_temp, const int N, const int K) {
//int k;
//omp_set_nested(1);
//#pragma omp parallel for private(k) num_threads(4)
//for (int p = 0; p < N; p++) {
// #pragma omp parallel for num_threads(4)
// for (k = 0; k < K; k++) {
// h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
// h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
// }
//}
//int p;
//omp_set_nested(1);
//#pragma omp parallel for private(p) num_threads(4)
//for (int k = 0; k < K; k++) {
// #pragma omp parallel for num_threads(4)
// for (p = 0; p < N; p++) {
// h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
// h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
// }
//}
//for (int p = 0; p < N; p++) {
// for (int k = 0; k < K; k++) {
// h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
// h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
// }
//}
//for (int k = 0; k < K; k++) {
// for (int p = 0; p < N; p++) {
// h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
// h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
// }
//}
#pragma omp parallel for num_threads(8)
for (int p = 0; p < K * N; p++) {
int col = p % N;
int row = p / K;
h_xhatpruning[col * K + row][0] = h_xhatpruning_temp[col + row * N][0];
h_xhatpruning[col * K + row][1] = h_xhatpruning_temp[col + row * N][1];
}
//for (int p = 0; p < N; p += 2) {
// for (int k = 0; k < K; k++) {
// for (int p0 = 0; p0 < 2; p0++) {
// h_xhatpruning[(p + p0) * K + k][0] = h_xhatpruning_temp[(p + p0) + k * N][0];
// h_xhatpruning[(p + p0) * K + k][1] = h_xhatpruning_temp[(p + p0) + k * N][1];
// }
// }
//}
}
/********/
/* MAIN */
/********/
void main() {
int N = 10;
int K = 100000;
// --- CPU memory allocations
fftw_complex *h_x = (fftw_complex *)malloc(N * sizeof(fftw_complex));
fftw_complex *h_xzp = (fftw_complex *)calloc(N * K, sizeof(fftw_complex));
fftw_complex *h_xpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhatpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhatpruning_temp = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhat = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
//double2 *h_xhatGPU = (double2 *)malloc(N * K * sizeof(double2));
// --- Random number generation of the data sequence on the CPU - moving the data from CPU to GPU
srand(time(NULL));
for (int k = 0; k < N; k++) {
h_x[k][0] = (double)rand() / (double)RAND_MAX;
h_x[k][1] = (double)rand() / (double)RAND_MAX;
}
//gpuErrchk(cudaMemcpy(d_x, h_x, N * sizeof(double2), cudaMemcpyHostToDevice));
memcpy(h_xzp, h_x, N * sizeof(fftw_complex));
// --- FFTW and cuFFT plans
fftw_plan h_plan_zp = fftw_plan_dft_1d(N * K, h_xzp, h_xhat, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan h_plan_pruning = fftw_plan_many_dft(1, &N, K, h_xpruning, NULL, 1, N, h_xhatpruning_temp, NULL, 1, N, FFTW_FORWARD, FFTW_ESTIMATE);
double totalTimeCPU = 0., totalTimeGPU = 0.;
double partialTimeCPU, partialTimeGPU;
/****************************/
/* STANDARD APPROACH ON CPU */
/****************************/
printf("Number of processors available = %i\n", omp_get_num_procs());
printf("Number of threads = %i\n", omp_get_max_threads());
TimingCPU timerCPU;
timerCPU.StartCounter();
fftw_execute(h_plan_zp);
printf("\nStadard on CPU: \t \t %f\n", timerCPU.GetCounter());
/******************/
/* STEP #1 ON CPU */
/******************/
timerCPU.StartCounter();
step1CPU(h_xpruning, h_x, N, K);
partialTimeCPU = timerCPU.GetCounter();
totalTimeCPU = totalTimeCPU + partialTimeCPU;
printf("\nOptimized first step CPU: \t %f\n", totalTimeCPU);
/******************/
/* STEP #2 ON CPU */
/******************/
timerCPU.StartCounter();
fftw_execute(h_plan_pruning);
partialTimeCPU = timerCPU.GetCounter();
totalTimeCPU = totalTimeCPU + partialTimeCPU;
printf("Optimized second step CPU: \t %f\n", timerCPU.GetCounter());
/******************/
/* STEP #3 ON CPU */
/******************/
timerCPU.StartCounter();
step3CPU(h_xhatpruning, h_xhatpruning_temp, N, K);
partialTimeCPU = timerCPU.GetCounter();
totalTimeCPU = totalTimeCPU + partialTimeCPU;
printf("Optimized third step CPU: \t %f\n", partialTimeCPU);
printf("Total time CPU: \t \t %f\n", totalTimeCPU);
double rmserror = 0., norm = 0.;
for (int n = 0; n < N; n++) {
rmserror = rmserror + (h_xhatpruning[n][0] - h_xhat[n][0]) * (h_xhatpruning[n][0] - h_xhat[n][0]) + (h_xhatpruning[n][1] - h_xhat[n][1]) * (h_xhatpruning[n][1] - h_xhat[n][1]);
norm = norm + h_xhat[n][0] * h_xhat[n][0] + h_xhat[n][1] * h_xhat[n][1];
}
printf("\nrmserror %f\n", 100. * sqrt(rmserror / norm));
fftw_destroy_plan(h_plan_zp);
}
For the case
N = 10
K = 100000
my timing is the following
Stadard on CPU: 23.895417
Optimized first step CPU: 4.472087
Optimized second step CPU: 4.926603
Optimized third step CPU: 2.394958
Total time CPU: 11.793648

Related

How to make my Cuda kernel run on bigger matrixes?

So my code is suppsed to work like this:
-take in_martix of NxN elements and R factor
-it should give back a matrix of size [N-2R]*[N-2R] with each element being a sum of in_matrix elements in R radius it should work like this for N=4 R=1
Even though my code works for smaller matrixes, for bigger ones like 1024 or 2048 or even bigger R factors it gives back a matrix of 0's. Is it a problem inside my code or just my GPU can't compute more calculations ?
Code: (for testing purposes initial matrix is filled with 1's so every element of out_matrix should == (2R+1)^2
#include "cuda_runtime.h"
#include <stdio.h>
#include <iostream>
#include <cuda_profiler_api.h>
#define N 1024
#define R 128
#define K 1
#define THREAD_BLOCK_SIZE 8
using namespace std;
__global__ void MatrixStencil(int* d_tab_begin, int* d_out_begin, int d_N, int d_R, int d_K) {
int tx = threadIdx.x + blockIdx.x * blockDim.x;
int ty = threadIdx.y + blockIdx.y * blockDim.y;
int out_local = 0;
for (int col = tx; col <= tx + 2 * d_R ; col++)
for (int row = ty; row <= ty + 2 * d_R ; row++)
out_local += *(d_tab_begin + col * d_N + row);
*(d_out_begin + (tx) * (d_N - 2 * R) + ty) = out_local;
}
void random_ints(int tab[N][N]) {
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
tab[i][j] = 1;
}
int main() {
static int tab[N][N];
random_ints(tab);
int tab_size = sizeof(int) * N * N;
int out_size = sizeof(int) * (N - 2 * R) * (N - 2 * R);
dim3 BLOCK_SIZE(THREAD_BLOCK_SIZE, THREAD_BLOCK_SIZE);
dim3 GRID_SIZE(ceil((float)N / (float)(THREAD_BLOCK_SIZE )), ceil((float)N / (float)(THREAD_BLOCK_SIZE )));
void** d_tab;
void** d_out;
cudaMalloc((void**)&d_tab, tab_size);
cudaMalloc((void**)&d_out, out_size);
cudaMemcpyAsync(d_tab, tab, tab_size, cudaMemcpyHostToDevice);
int* d_tab_begin = (int*)(d_tab);
int* d_out_begin = (int*)(d_out);
MatrixStencil << < GRID_SIZE, BLOCK_SIZE>> > (d_tab_begin, d_out_begin, N, R, K);
int* out = (int*)malloc(out_size);
cudaMemcpyAsync(out, d_out, out_size, cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
for (int col = 0; col < N - 2 * R; col++)
{
for (int row = 0; row < N - 2 * R; row++)
{
cout << *(out + ((col * (N - 2 * R)) + row)) << " ";
}
cout << endl;
}
}

Finally thanks to Robert I found how to make the code work - by adding if statment
if ((tx < d_N - 2 * d_R) && (ty < d_N - 2 * d_R)) {
for (int col = tx; col <= tx + 2 * d_R; col++)
for (int row = ty; row <= ty + 2 * d_R; row++)
out_local += *(d_tab_begin + col * d_N + row);
*(d_out_begin + (tx) * (d_N - 2 * R) + ty) = out_local;
}

Partial Derivative of Real Data Using Fourier Transform(FFTW) in 2D Gives Not Giving Correct Results

I am trying to calculate the partial derivative of the function Sin(x)Sin(y) along the x-direction using Fourier Transform (FFTW library in C++). I take my function from real to Fourier space (fftw_plan_dft_r2c_2d), multiply the result by -i * kappa array (wavenumber/spatial frequency), and then transform the result back into real space (fftw_plan_dft_c2r_2d). Finally, I divide the result by the size of my 2D array Nx*Ny (I need to do this based on the documentation). However, the resulting function is scaled up by some value and the amplitude is not 1.
'''
#define Nx 360
#define Ny 360
#define REAL 0
#define IMAG 1
#define Pi 3.14
#include <stdio.h>
#include <math.h>
#include <fftw3.h>
#include <iostream>
#include <iostream>
#include <fstream>
#include <iomanip>
int main() {
int i, j;
int Nyh = Ny / 2 + 1;
double k1;
double k2;
double *signal;
double *result2;
double *kappa1;
double *kappa2;
double df;
fftw_complex *result1;
fftw_complex *dfhat;
signal = (double*)fftw_malloc(sizeof(double) * Nx * Ny );
result2 = (double*)fftw_malloc(sizeof(double) * Nx * Ny );
kappa1 = (double*)fftw_malloc(sizeof(double) * Nx * Nyh );
result1 = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * Nx * Nyh);
dfhat = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * Nx * Nyh);
for (int i = 0; i < Nx; i++){
if ( i < Nx/2 + 1)
k1 = 2 * Pi * (-Nx + i) / Nx;
else
k1 = 2 * Pi * i / Nx;
for (int j = 0; j < Nyh; j++){
kappa1[j + Nyh * i] = k1;
}
}
fftw_plan plan1 = fftw_plan_dft_r2c_2d(Nx,Ny,signal,result1,FFTW_ESTIMATE);
fftw_plan plan2 = fftw_plan_dft_c2r_2d(Nx,Ny,dfhat,result2,FFTW_ESTIMATE);
for (int i=0; i < Nx; i++){
for (int j=0; j < Ny; j++){
double x = (double)i / 180 * (double) Pi;
double y = (double)j / 180 * (double) Pi;
signal[j + Ny * i] = sin(x) * sin(y);
}
}
fftw_execute(plan1);
for (int i = 0; i < Nx; i++) {
for (int j = 0; j < Nyh; j++) {
dfhat[j + Nyh * i][REAL] = 0;
dfhat[j + Nyh * i][IMAG] = 0;
dfhat[j + Nyh * i][IMAG] = -result1[j + Nyh * i][REAL] * kappa1[j + Nyh * i];
dfhat[j + Nyh * i][REAL] = result1[j + Nyh * i][IMAG] * kappa1[j + Nyh * i];
}
}
fftw_execute(plan2);
for (int i = 0; i < Nx; i++) {
for (int j = 0; j < Ny; j++) {
result2[j + Ny * i] /= (Nx*Ny);
}
}
for (int j = 0; j < Ny; j++) {
for (int i = 0; i < Nx; i++) {
std::cout <<std::setw(15)<< result2[j + Ny * i];
}
std::cout << std::endl;
}
fftw_destroy_plan(plan1);
fftw_destroy_plan(plan2);
fftw_free(result1);
fftw_free(dfhat);
return 0;
}
'''
This is how the result looks
I have defined my kappa array to have Nx * Ny/2+1 elements and have split the array at the center (as all references suggest). But it does not work. I would appreciate any help!

fftw in C++ gets slower for power of 2?

I'm working with the fftw library in C++. I know that the calculation of the fft is most efficient for powers of 2, but I created a minimal example of a two-dimensional fft and I get a different result. The 2d-fft with no power of 2 is calculated much faster than the other one. Here is my code:
int N = 2083;
int M = 2087;
int Npow2 = pow(2, ceil(log2(N)));
int Mpow2 = pow(2, ceil(log2(M)));
fftw_complex * signala = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* N * M);
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
signala[i*M + j][0] = rand();
signala[i*M + j][0] = 0;
}
}
fftw_complex * signala_ext = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* Npow2 * Mpow2);
fftw_complex * outa = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* N * M);
fftw_complex * outaext = (fftw_complex *)fftw_malloc(sizeof(fftw_complex)* Npow2 * Mpow2);
//Create Plans
fftw_plan pa = fftw_plan_dft_2d(N, M, signala, outa, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan paext = fftw_plan_dft_2d(Npow2, Mpow2, signala_ext, outaext, FFTW_FORWARD, FFTW_ESTIMATE);
//zeropadding
memset(signala_ext, 0, sizeof(fftw_complex)* Npow2 * Mpow2); //Null setzen
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
signala_ext[i*Mpow2 + j][0] = signala[i*M + j][0];
signala_ext[i*Mpow2 + j][1] = signala[i*M + j][1];
}
}
//Execute FFT
double tstart1 = clock();
fftw_execute(pa);
double time1 = (clock() - tstart1) / CLOCKS_PER_SEC;
printf("Time: %f sec\n", time1);
double tstart2 = clock();
fftw_execute(paext);
double time2 = (clock() - tstart2) / CLOCKS_PER_SEC;
printf("Time: %f sec\n", time2);
I choosed prime numbers for N and M. My programms returns:
For signala (non-power-of-2): 2.95 sec
For signala_ext (power-of-2): 5.232 sec
Why is the fft with power of 2 so much slower? What have I done wrong?
I will be thankful for any help!

FFTW likes dimensions which are products of powers of small primes. The nearest value above 2083 or 2087 which meets this criterion is 2100 (2100 = 22 * 3 * 52 * 7), so if you go for dimensions of 2100 x 2100 then you should see decent performance.

Optimize log entropy calculation in sparse matrix

I have a 3007 x 1644 dimensional matrix of terms and documents. I am trying to assign weights to frequency of terms in each document so I'm using this log entropy formula http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix (See entropy formula in the last row).
I'm successfully doing this but my code is running for >7 minutes.
Here's the code:
int N = mat.cols();
for(int i=1;i<=mat.rows();i++){
double gfi = sum(mat(i,colon()))(1,1); //sum of occurrence of terms
double g =0;
if(gfi != 0){// to avoid divide by zero error
for(int j = 1;j<=N;j++){
double tfij = mat(i,j);
double pij = gfi==0?0.0:tfij/gfi;
pij = pij + 1; //avoid log0
double G = (pij * log(pij))/log(N);
g = g + G;
}
}
double gi = 1 - g;
for(int j=1;j<=N;j++){
double tfij = mat(i,j) + 1;//avoid log0
double aij = gi * log(tfij);
mat(i,j) = aij;
}
}
Anyone have ideas how I can optimize this to make it faster? Oh and mat is a RealSparseMatrix from amlpp matrix library.
UPDATE
Code runs on Linux mint with 4gb RAM and AMD Athlon II dual core
Running time before change: > 7mins
After #Kereks answer: 4.1sec

Here's a very naive rewrite that removes some redundancies:
int const N = mat.cols();
double const logN = log(N);
for (int i = 1; i <= mat.rows(); ++i)
{
double const gfi = sum(mat(i, colon()))(1, 1); // sum of occurrence of terms
double g = 0;
if (gfi != 0)
{
for (int j = 1; j <= N; ++j)
{
double const pij = mat(i, j) / gfi + 1;
g += pij * log(pij);
}
g /= logN;
}
for (int j = 1; j <= N; ++j)
{
mat(i,j) = (1 - g) * log(mat(i, j) + 1);
}
}
Also make sure that the matrix data structure is sane (e.g. a flat array accessed in strides; not a bunch of dynamically allocated rows).
Also, I think the first + 1 is a bit silly. You know that x -> x * log(x) is continuous at zero with limit zero, so you should write:
double const pij = mat(i, j) / gfi;
if (pij != 0) { g += pij + log(pij); }
In fact, you might even write the first inner for loop like this, avoiding a division when it isn't needed:
for (int j = 1; j <= N; ++j)
{
if (double pij = mat(i, j))
{
pij /= gfi;
g += pij * log(pij);
}
}

DFT with Frequency Range

We need to change/reimplement standard DFT implementation in GSL, which is
int
FUNCTION(gsl_dft_complex,transform) (const BASE data[],
const size_t stride, const size_t n,
BASE result[],
const gsl_fft_direction sign)
{
size_t i, j, exponent;
const double d_theta = 2.0 * ((int) sign) * M_PI / (double) n;
/* FIXME: check that input length == output length and give error */
for (i = 0; i < n; i++)
{
ATOMIC sum_real = 0;
ATOMIC sum_imag = 0;
exponent = 0;
for (j = 0; j < n; j++)
{
double theta = d_theta * (double) exponent;
/* sum = exp(i theta) * data[j] */
ATOMIC w_real = (ATOMIC) cos (theta);
ATOMIC w_imag = (ATOMIC) sin (theta);
ATOMIC data_real = REAL(data,stride,j);
ATOMIC data_imag = IMAG(data,stride,j);
sum_real += w_real * data_real - w_imag * data_imag;
sum_imag += w_real * data_imag + w_imag * data_real;
exponent = (exponent + i) % n;
}
REAL(result,stride,i) = sum_real;
IMAG(result,stride,i) = sum_imag;
}
return 0;
}
In this implementation, GSL iterates over input vector twice for sample/input size. However, we need to construct for different frequency bins. For instance, we have 4096 samples, but we need to calculate DFT for 128 different frequencies. Could you help me to define or implement required DFT behaviour? Thanks in advance.
EDIT: We do not search for first m frequencies.
Actually, is below approach correct for finding DFT result with given frequency bin number?
N = sample size
B = frequency bin size
k = 0,...,127 X[k] = SUM(0,N){x[i]*exp(-j*2*pi*k*i/B)}
EDIT: I might have not explained the problem for DFT elaborately, nevertheless, I am happy to provide the answer below:
void compute_dft(const std::vector<double>& signal,
const std::vector<double>& frequency_band,
std::vector<double>& result,
const double sampling_rate)
{
if(0 == result.size() || result.size() != (frequency_band.size() << 1)){
result.resize(frequency_band.size() << 1, 0.0);
}
//note complex signal assumption
const double d_theta = -2.0 * PI * sampling_rate;
for(size_t k = 0; k < frequency_band.size(); ++k){
const double f_k = frequency_band[k];
double real_sum = 0.0;
double imag_sum = 0.0;
for(size_t n = 0; n < (signal.size() >> 1); ++n){
double theta = d_theta * f_k * (n + 1);
double w_real = cos(theta);
double w_imag = sin(theta);
double d_real = signal[2*n];
double d_imag = signal[2*n + 1];
real_sum += w_real * d_real - w_imag * d_imag;
imag_sum += w_real * d_imag + w_imag * d_real;
}
result[2*k] = real_sum;
result[2*k + 1] = imag_sum;
}
}

Assuming you just want the the first m output frequencies:
int
FUNCTION(gsl_dft_complex,transform) (const BASE data[],
const size_t stride,
const size_t n, // input size
const size_t m, // output size (m <= n)
BASE result[],
const gsl_fft_direction sign)
{
size_t i, j, exponent;
const double d_theta = 2.0 * ((int) sign) * M_PI / (double) n;
/* FIXME: check that m <= n and give error */
for (i = 0; i < m; i++) // for each of m output bins
{
ATOMIC sum_real = 0;
ATOMIC sum_imag = 0;
exponent = 0;
for (j = 0; j < n; j++) // for each of n input points
{
double theta = d_theta * (double) exponent;
/* sum = exp(i theta) * data[j] */
ATOMIC w_real = (ATOMIC) cos (theta);
ATOMIC w_imag = (ATOMIC) sin (theta);
ATOMIC data_real = REAL(data,stride,j);
ATOMIC data_imag = IMAG(data,stride,j);
sum_real += w_real * data_real - w_imag * data_imag;
sum_imag += w_real * data_imag + w_imag * data_real;
exponent = (exponent + i) % n;
}
REAL(result,stride,i) = sum_real;
IMAG(result,stride,i) = sum_imag;
}
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Accelerating FFTW pruning to avoid massive zero padding - c++

Related

How to make my Cuda kernel run on bigger matrixes?

Partial Derivative of Real Data Using Fourier Transform(FFTW) in 2D Gives Not Giving Correct Results

fftw in C++ gets slower for power of 2?

Optimize log entropy calculation in sparse matrix

DFT with Frequency Range

Categories

Resources