I am parallelizing the execution of the following loop on a CUDA GPU:
// define m, lp, N
for(int i=0; i<N; ++i){
float p, s;
int q;
s = m + sqrt( ARR1[ ARR2[i] ] )*ARR3[i];
if ( ARR4[2*i] <= ARR10[i] ){
if ( s > 0){
p = lp*s;
q = floor( ARR4[2*i+1]*ARR5[i]/p );
} else{
p = -lp/s;
q = -floor( ARR4[2*i+1]*ARR6[i] );
}
} else{
if ( s > 0){
p = lp/s;
q = -floor( ARR4[2*i+1]*ARR6[i] );
} else{
p = -lp*s;
q = floor( ARR4[2*i+1]*ARR5[i]/p );
}
}
if ( q != 0){
ARR7[i] = p;
ARR8[i] = q;
} else{
ARR7[i] = 0;
ARR8[i] = 0;
}
ARR9[i] = i;
}
I would like to evaluate its arithmetic intensity. m and lp are defined outside of the loop.
I count 11 memory operations: ARR2[i], ARR1[ARR2[i]], ARR3[i], ARR4[2*i], ARR4[2*i+1], ARR5[i], ARR6[i], ARR7[i], ARR8[i], ARR9[i], ARR10[i],
... and 9 floating-point operations (counting floor and sqrt as one FLOP each): m + sqrt( ARR1[ ARR2[i] ] )*ARR3[i] (3), p = lp*s or variations (1), q = floor( ARR4[2*i+1]*ARR5[i]/p ) or variations (5, including 2 for index calculation).
Since all array elements are 4-bit long, this gives me an arithmetic intensity of 9/(4*11) = 0.2045. Is this correct? Am I counting memory and arithmetic operations correctly? In particular, I'm unsure whether the index array calculation 2*i+1 should count towards the FLOP count, and whether the scalar values m and lp should count towards the data movement count (or are they kept in registers and therefore do not count, see AXPY example on p. 16 here.
Related
Consider that I need a n-sized vector where each element is defined between [-1,1]. The element a[i] is a float generated by -1 + 2*rand(). I need a elegant way to ensure that the sum of the elements of my array is equal to zero.
I've found two possible solutions:
The first one is this matlab function https://www.mathworks.com/matlabcentral/fileexchange/9700-random-vectors-with-fixed-sum. It has also a implementation in R, however it is too much work to implement it on C, since this function is used for a 2d array.
The second one is provided in this thread here: Generate random values with fixed sum in C++. Essentially, the idea is to generate n numbers with a normal distribution then normalize them to with my sum. (I have implemented it using python bellow) for a vector with sum up to 1.0. It works for every sum value except for zero.
import random as rd
mySum = 1;
randomVector = []
randomSum = 0
for i in range(7):
randomNumber = -1 + 2*rd.random()
randomVector.append(randomNumber)
randomSum += randomNumber
coef = mySum/randomSum
myNewList = [j * coef for j in randomVector]
newsum = sum(myNewList)
So, is there a way to do that using C or C++? If you know a already implemented function it would be awesome. Thanks.
I figured out a solution to your problem. This is not perfect since its randomness is limited by the range requirement.
The strategy is:
Define a function able to generate a random float in a customizable range. No need to reinvent the wheel: I borrowed it from https://stackoverflow.com/a/44105089/11336762
Malloc array (I omit pointer check in my example) and initialize the seed. In my example I just used current time but it can be improved
For every element to be generated, pre-calculate random range. Given the i-th sum, make sure that the next sum is NEVER out of range: if the sum is positive, the range needs to be (-1,1-sum); if it is negative it the range needs to be (-1-sum,1)
Do this until (n-1)th element. Last element must be directly assigned as the sum with the sign changed.
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
float float_rand( float min, float max )
{
float scale = rand() / (float) RAND_MAX; /* [0, 1.0] */
return min + scale * ( max - min ); /* [min, max] */
}
void main( int argc, char *argv[] )
{
if( argc == 2 )
{
int i, n = atoi ( argv[1] );
float *outArr = malloc( n * sizeof( float ) );
float sum = 0;
printf( "Input value: %d\n\n", n );
/* Initialize seed */
srand ( time( NULL ) );
for( i=0; i<n-1; i++ )
{
/* Limit random generation range in order to make sure the next sum is *
* not outside (-1,1) range. */
float min = (sum<0? -1-sum : -1);
float max = (sum>0? 1-sum : 1);
outArr[i] = float_rand( min, max );
sum += outArr[i];
}
/* Set last array element */
outArr[n-1] = -sum;
/* Print results */
sum=0;
for( i=0; i<n; i++ )
{
sum += outArr[i];
printf( " outArr[%d]=%f \t(sum=%f)\n", i, outArr[i], sum );
}
free( outArr );
}
else
{
printf( "Only a parameter allowed (integer N)\n" );
}
}
I tried it, and it works also when n=1. In case of n=0 a sanity check should be added to my example.
Some output examples:
N=1:
Input value: 1
outArr[0]=-0.000000 (sum=-0.000000)
N=4
Input value: 4
outArr[0]=-0.804071 (sum=-0.804071)
outArr[1]=0.810685 (sum=0.006614)
outArr[2]=-0.353444 (sum=-0.346830)
outArr[3]=0.346830 (sum=0.000000)
N=8:
Input value: 8
outArr[0]=-0.791314 (sum=-0.791314)
outArr[1]=0.800182 (sum=0.008867)
outArr[2]=-0.571293 (sum=-0.562426)
outArr[3]=0.293300 (sum=-0.269126)
outArr[4]=-0.082886 (sum=-0.352012)
outArr[5]=0.818639 (sum=0.466628)
outArr[6]=-0.301473 (sum=0.165155)
outArr[7]=-0.165155 (sum=0.000000)
Thank you guys again for the help.
So, based on the idea of Cryostasys I developed the following C code to solve my problem:
#include <stdio.h> /* printf, scanf, puts, NULL */
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
#include <math.h>
int main()
{
int arraySize = 10; //input value
double createdArray[arraySize]; //output value
double randomPositiveVector[arraySize];
double randomNegativeVector[arraySize];
double positiveSum = 0.;
double negativeSum = 0.;
srand(time(NULL)); //seed for random generation
for(int i = 0; i < arraySize; ++i)
{
double randomNumber = -1.+2.*rand()/((double) RAND_MAX); //random in [-1.0,1.0]
printf("%f\n",randomNumber);
if(randomNumber >=0)
{
randomPositiveVector[i] = randomNumber;
positiveSum += randomNumber;
}
else
{
randomNegativeVector[i] = randomNumber;
negativeSum += randomNumber;
}
}
if(positiveSum == 0. || negativeSum == 0.) printf("ERROR\n");
double positiveCoefficient = 1.0/positiveSum;
double negativeCoefficient = -1.0/negativeSum;
for(int i = 0; i < arraySize; ++i)
{
randomPositiveVector[i] = positiveCoefficient * randomPositiveVector[i];
randomNegativeVector[i] = negativeCoefficient * randomNegativeVector[i];
if(fabs(randomPositiveVector[i]) > 1e-6) //near to zero
{
createdArray[i] = randomPositiveVector[i];
}
else
{
createdArray[i] = randomNegativeVector[i];
}
}
for(int i = 0; i < arraySize; ++i)
{
printf("createdArray[%d] = %9f\n",i,createdArray[i]);
}
return(0);
}
Please note that the randomness of the values generated is decreased, as mentioned in the comments of the question. Also, the kind of random distribution is determined by the function that you use to generate the randomNumber above. In this case, I've used rand() from stdlib.h which is based on giving a seed to the function and it is going to generate a pseudo-random number. You could use a different option, for instance, drand48() from stdlib.h as well.
Nevertheless, it is required that at least one positive and one negative value is generated in order to this code work. One verification step was added to the code, and if it reaches this condition one should run again the code or do something about.
Output example (arraySize = 10):
createdArray[0] = -0.013824
createdArray[1] = 0.359639
createdArray[2] = -0.005851
createdArray[3] = 0.126829
createdArray[4] = -0.334745
createdArray[5] = -0.473096
createdArray[6] = -0.172484
createdArray[7] = 0.249523
createdArray[8] = 0.262370
createdArray[9] = 0.001640
One option is to generate some samples and then scale their values around the average. In C++ it would be something like the following
#include <iostream>
#include <iomanip>
#include <random>
#include <algorithm>
#include <cmath>
int main()
{
std::random_device rd;
std::seed_seq ss{rd(), rd(), rd(), rd()};
std::mt19937 gen{ss};
const int samples = 9;
// Generates the samples in [0, 2]
std::uniform_real_distribution dist(0.0, std::nextafter(2.0, 3.0));
std::vector<double> nums(samples);
double sum = 0.0;
for ( auto & i : nums )
{
i = dist(gen);
sum += i;
}
double average = sum / samples;
double k = 1.0 / std::max(average, 2.0 - average);
// Transform the values (apart from the last) to meet the requirements
sum = 0.0;
for ( size_t i = 0; i < nums.size() - 1; ++i )
{
nums[i] = (nums[i] - average) * k;
sum += nums[i];
};
// This trick (to ensure the needed precision) only works if the sum
// is always evaluated in the same order
nums.back() = 0.0 - sum;
sum = 0.0;
for ( size_t i = 0; i < nums.size(); ++i )
{
sum += nums[i];
std::cout << std::setw(10) << std::fixed << nums[i] << '\n';
}
if (sum != 0.0)
std::cout << "Failed.\n";
}
Testable here.
FFT works fine, but when I want to take IFFT I always see the same graph from its results. Results are complex and graph always the same regardless of the original signal.
in real part graph is a -sin with period = frame size
in imaginary part it is a -cos with the same period
Where can be a problem?
original signal:
IFFT real value (on pics are only half of frame):
Algorithm FFT that I use.
double** FFT(double** f, int s, bool inverse) {
if (s == 1) return f;
int sH = s / 2;
double** fOdd = new double*[sH];
double** fEven = new double*[sH];
for (int i = 0; i < sH; i++) {
int j = 2 * i;
fOdd[i] = f[j];
fEven[i] = f[j + 1];
}
double** sOdd = FFT(fOdd, sH, inverse);
double** sEven = FFT(fEven, sH, inverse);
double**spectr = new double*[s];
double arg = inverse ? DoublePI / s : -DoublePI / s;
double*oBase = new double[2]{ cos(arg),sin(arg) };
double*o = new double[2]{ 1,0 };
for (int i = 0; i < sH; i++) {
double* sO1 = Mul(o, sOdd[i]);
spectr[i] = Sum(sEven[i], sO1);
spectr[i + sH] = Dif(sEven[i], sO1);
o = Mul(o, oBase);
}
return spectr;
}
The "butterfly" portion is applying the coefficients incorrectly:
for (int i = 0; i < sH; i++) {
double* sO1 = sOdd[i];
double* sE1 = Mul(o, sEven[i]);
spectr[i] = Sum(sO1, sE1);
spectr[i + sH] = Dif(sO1, sE1);
o = Mul(o, oBase);
}
Side Note:
I kept your notation but it makes things confusing:
fOdd has indexes 0, 2, 4, 6, ... so it should be fEven
fEven has indexes 1, 3, 5, 7, ... so it should be fOdd
really sOdd should be sLower and sEven should be sUpper since they correspond to the 0:s/2 and s/2:s-1 elements of the spectrum respectively:
sLower = FFT(fEven, sH, inverse); // fEven is 0, 2, 4, ...
sUpper = FFT(fOdd, sH, inverse); // fOdd is 1, 3, 5, ...
Then the butterfly becomes:
for (int i = 0; i < sH; i++) {
double* sL1 = sLower[i];
double* sU1 = Mul(o, sUpper[i]);
spectr[i] = Sum(sL1, sU1);
spectr[i + sH] = Dif(sL1, sU1);
o = Mul(o, oBase);
}
When written like this it is easier to compare to this pseudocode example on wikipedia.
And #Dai is correct you are going to leak a lot of memory
Regarding the memory, you can use the std::vector to encapsulate dynamically-allocated arrays and to ensure they're deallocated when execution leaves scope. You could use unique_ptr<double[]> but the performance gains are not worth it IMO and you lose the safety of the at() method.
(Based on #Robb's answer)
A few other tips:
Avoid cryptic identifiers - programs should be readable, and names like "f" and "s" make your program harder to read and maintain.
Type-based Hungarian notation is frowned upon as modern editors show type information automatically so it adds unnecessary complication to identifier names.
Use size_t for indexes, not int
The STL is your friend, use it!
Preemptively prevent bugs by using const to prevent accidental mutation of read-only data.
Like so:
#include <vector>
using namespace std;
vector<double> fastFourierTransform(const vector<double> signal, const bool inverse) {
if( signal.size() < 2 ) return signal;
const size_t half = signal.size() / 2;
vector<double> lower; lower.reserve( half );
vector<double> upper; upper.reserve( half );
bool isEven = true;
for( size_t i = 0; i < signal.size(); i++ ) {
if( isEven ) lower.push_back( signal.at( i ) );
else upper.push_back( signal.at( i ) );
isEven = !isEven;
}
vector<double> lowerFft = fastFourierTransform( lower, inverse );
vector<double> upperFft = fastFourierTransform( upper, inverse );
vector<double> result;
result.reserve( signal.size() );
double arg = ( inverse ? 1 : -1 ) * ( DoublePI / signal.size() );
// Ideally these should be local `double` values passed directly into `Mul`.
unique_ptr<double[]> oBase = make_unique<double[]>( 2 );
oBase[0] = cos(arg);
oBase[1] = sin(arg);
unique_ptr<double[]> o = make_unique<double[]>( 2 );
o[0] = 0;
o[1] = 0;
for( size_t i = 0; i < half; i++ ) {
double* lower1 = lower.at( i );
double* upper1 = Mul( o, upper.at( i ) );
result.at( i ) = Sum( lower1, upper1 );
result.at( i + half ) = Dif( lower1, upper1 );
o = Mul( o, oBase );
}
// My knowledge of move-semantics of STL containers is a bit rusty - so there's probably a better way to return the output 'result' vector.
return result;
}
I need solve this operation in a while loop. The N,X, and Z are integers given by the user.
I tried this, but it does not show me the real results.
while (i <= n) {
double r = 1, p = 1;
p = x / n + z;
p = p * p;
cout << "Resultado: " <<p<< endl;
i++;
}
Your code at least has three issues:
You're re-declaring and re-initializing p every loop iteration, losing the previous value.
You're setting p to x/n+z every iteration, losing the previous value.
Your x/n+z executes the division before the addition.
You're continuously "resetting" p's value here:
while(i <= n)
{
// ...
// `p` is getting re-initialized to 1 here:
// (losing the previous value)
double r=1, p=1;
// `p` is being set to `x/n+z` here:
// (losing the previous value)
p = x/n+z;
p = p*p;
// ...
}
Make a temporary variable instead, and move p's declaration outside the loop:
double p = 1;
while(i <= n)
{
// ...
double temp = x/n+z;
p = p * temp;
// ...
}
Also, as noted by Daniel S., you require parenthesis around n+z:
double temp0 = x/n+z;
// Evaluates to (x/n)+z.
double temp1 = x/(n+z);
// Evaluates to x/(n+z). (Which is what you want.)
This happens because the / division operator has higher precedence than the + addition operator. Learn about operator precedence here.
Some C++ syntaxe mistake and a good math error
int i=1; // don't forget the initialization of i
double p = 1/2; // p will be your result, stored outside of the while so we keep memory
while(i<=n) // you want to loop from 1 to n included
{
// we don't need r
p = p * x / (n + z); // you forgot the parenthesis here, without them you are doing (x / n) + z;
}
So at start p = 1/2 which is the left part of your equation
then at each loop we multiply the current value of p by the factor x / (n + z).
As this factor doesn't change from one loop to an other you could also store it somewhere.
This should be working.
double s;
double p = 1;
int n, x, z;
int i = 1;
while (i <= n)
{
p = p*(x / (n + z));
i++;
}
s = 1 / 2 * p;
I have problem with the following code:
int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
dist2[i].first = std::numeric_limits<float>::max();
dist2[i].second = i;
}
// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
float sum_distribution = 0.0;
// look for the point that is furthest from any center
begin = omp_get_wtime();
#pragma omp parallel for reduction(+:sum_distribution)
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
sum_distribution += dist2[i].first;
}
end = omp_get_wtime() - begin;
std::cout << "center assigning -- "
<< ndx << " of " << k << " = "
<< (float)ndx / k * 100
<< "% is done. Elasped time: "<< (float)end <<"\n";
/**/
bool unique = true;
do {
// choose a random interval according to the new distribution
float r = sum_distribution * (float)rand() / (float)RAND_MAX;
float sum_cdf = dist2[0].first;
int cdf_ndx = 0;
while (sum_cdf < r) {
sum_cdf += dist2[++cdf_ndx].first;
}
chosen_pts[ndx] = cdf_ndx;
for (int i = 0; i < ndx; ++i) {
unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
}
} while (! unique);
++ndx;
}
As you can see i use omp to make parallel the for loop. It works fine and i can achive a significant speed up. However if i increase the value of x.n over 20000000 the function stops to work after 8-10 loops:
It doestn produces any output (std::cout)
Only one core works
No error, whatsoever
If i comment out the do while loop, it works again as expected. All cores are busy and there is an output after each iteration, and i can increase k.n over 100 millions just as i need it.
It's not OpenMP parallel for getting stuck, it's obviously in your serial do-while loop.
One particular issue that I see is that there is no array boundary checks in the inner while loop accessing dist2. In theory, out-of-boundary access should never happen; but in practice it may - see below why. So first of all I would rewrite the calculation of cdf_ndx to guarantee that the loop ends when all elements are inspected:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
sum_cdf += dist2[cdf_ndx].first;
++cdf_ndx;
}
Now, how it may happen that sum_cdf does not reach r? It is due to specifics of floating-point arithmetic and the fact that sum_distribution was computed in parallel, while sum_cdf is computed serially. The problem is that contribution of one element to the sum can be below the accuracy for floats; in other words, when you sum two float values that differ more than ~8 orders of magnitude, the smaller one does not affect the sum.
So, with 20M of floats after some point it might happen that the next value to add is so small comparing to the accumulated sum_cdf that adding this value does not change it! On the other hand, sum_distribution was essentially computed as several independent partial sums (one per thread) then combined together. Thus it is more accurate, and possibly bigger than sum_cdf can ever reach.
A solution can be to compute sum_cdf in portions, having two nested loops. For example:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
float block_sum = 0;
int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
for (int i=cdf_ndx; i<block_end; ++i ) {
block_sum += dist2[i].first;
if( sum_cdf+block_sum >=r ) {
block_end = i; // adjust to correctly compute cdf_ndx
break;
}
}
sum_cdf += block_sum;
cdf_ndx = block_end;
}
And after the loop you need to check that cdf_ndx < x.n, otherwise repeat with a new random interval.
I want to compute 3D FFT using Intel MKL of an array which has about 300×200×200 elements. This 3D array is stored as a 1D array of type double in a columnwise fashion:
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
I want to perform not-in-place FFT on the input array and prevent it from getting modified (I need to use it later in my code) and then do the backward computation in-place. I also want to have zero padding.
My questions are:
How can I perform the zero-padding?
How should I deal with the size of the arrays used by FFT functions when zero padding is included in the computation?
How can I take out the zero padded results and get the actual result?
Here is my attempt to the problem, I would be absolutely thankful for any comment, suggestion, or hint.
#include <stdio.h>
#include "mkl.h"
int max(int a, int b, int c)
{
int m = a;
(m < b) && (m = b);
(m < c) && (m = c);
return m;
}
void FFT3D_R2C( // Real to Complex 3D FFT.
double *in, int nRowsIn , int nColsIn , int nHeightsIn ,
double *out )
{
int n = max( nRowsIn , nColsIn , nHeightsIn );
// Round up to the next highest power of 2.
unsigned int N = (unsigned int) n; // compute the next highest power of 2 of 32-bit n.
N--;
N |= N >> 1;
N |= N >> 2;
N |= N >> 4;
N |= N >> 8;
N |= N >> 16;
N++;
/* Strides describe data layout in real and conjugate-even domain. */
MKL_LONG rs[4], cs[4];
// DFTI descriptor.
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
// Variables needed for out-of-place computations.
MKL_Complex16 *in_fft = new MKL_Complex16 [ N*N*N ];
MKL_Complex16 *out_fft = new MKL_Complex16 [ N*N*N ];
double *out_ZeroPadded = new double [ N*N*N ];
/* Compute strides */
rs[3] = 1; cs[3] = 1;
rs[2] = (N/2+1)*2; cs[2] = (N/2+1);
rs[1] = N*(N/2+1)*2; cs[1] = N*(N/2+1);
rs[0] = 0; cs[0] = 0;
// Create DFTI descriptor.
MKL_LONG sizes[] = { N, N, N };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
// Configure DFTI descriptor.
DftiSetValue( fft_desc, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX );
DftiSetValue( fft_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE ); // Out-of-place transformation.
DftiSetValue( fft_desc, DFTI_INPUT_STRIDES , rs );
DftiSetValue( fft_desc, DFTI_OUTPUT_STRIDES , cs );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
// Change strides to compute backward transform.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , cs);
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, rs);
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out_fft, out_ZeroPadded );
// Printing the zero padded 3D FFT result.
for( long long i = 0; i < (long long)N*N*N; i++ )
printf("%f\n", out_ZeroPadded[i] );
/* I don't know how to take out the zero padded results and
save the actual result in the variable named "out" */
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] out_ZeroPadded ;
}
int main()
{
int n = 10;
double *a = new double [n*n*n]; // This array is real.
double *afft = new double [n*n*n];
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Calculate FFT.
FFT3D_R2C( a, n, n, n, afft );
printf("FFT results:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", afft[i] );
delete[] a;
delete[] afft;
return 0;
}
just few hints:
Power of 2 size
I don't like the way you are computing the size
so let Nx,Ny,Nz be the size of input matrix
and nx,ny,nz size of the padded matrix
for (nx=1;nx<Nx;nx<<=1);
for (ny=1;ny<Ny;ny<<=1);
for (nz=1;nz<Nz;nz<<=1);
now zero pad by memset to zero first and then copy the matrix lines
padding to N^3 instead of nx*ny*nz can result in big slowdowns
if nx,ny,nz are not close to each other
output is complex
if I get it right a is input real matrix
and afft the output complex matrix
so why not allocate the space for it correctly?
double *afft = new double [2*nx*ny*nz];
complex number is real+imaginary part so 2 values per number
that goes also for the final print of result
and some "\r\n" after lines would be good for viewing
3D DFFT
I do not use nor know your DFFT library
I use mine own, but anyway 3D DFFT can be done by 1D DFFT
if you do it by the lines ... see this 2D DFCT by 1D DFFT
in 3D is the same but you need to add one pass and different normalization constant
this way you can have single line buffer double lin[2*max(nx,ny,nz)];
and make the zero padding on the run (so no need to have bigger matrix in memory)...
but that involves coping the lines on each 1D DFFT ...