Parallelization doubles the execution time - c++

I'm trying to parallelize a loop with OpenMP but the result is that the user time spent (reported by /usr/bin/time) becomes twice as large compared to the unparallelized code. So the wall clock execution times of the parallelized and unparallelized code are approximately the same. Here is the code:
Real ComputeEPPMRMatrixElement7Opt4(
UniformGrid3d const & ugLeft, ColumnVector const & vLeft,
UniformGrid3d const & ugRight, ColumnVector const & vRight
)
{
assert( ugLeft.si == ugRight.si );
UniformGrid3d const ugInt = GridIntersection( ugLeft, ugRight );
if( ! EmptyGrid( ugInt ) )
{
Real rSum = 0.0;
Real const * const arLeft = vLeft.data( );
Real const * const arRight = vRight.data( );
#pragma omp parallel for reduction(+:rSum)
for( Integer nX = ugInt.tiMinX; nX <= ugInt.tiMaxX; nX++ )
{
for( Integer nY = ugInt.tiMinY; nY <= ugInt.tiMaxY; nY++ )
{
// We may do the following optimization because ugInt is the
// intersection of the left and the right grids.
// Actually it should be safe to remove the range checks from
// the two following function calls.
Integer const iLeft =
ugLeft.GetVectorIndexWithCheck( nX, nY, ugInt.tiMinZ );
Integer const iRight =
ugRight.GetVectorIndexWithCheck( nX, nY, ugInt.tiMinZ );
Real const * arLeft1 = &arLeft[ iLeft ];
Real const * arRight1 = &arRight[ iRight ];
for( Integer nZ = ugInt.tiMinZ; nZ <= ugInt.tiMaxZ; nZ++ )
{
Real const rLeft = *arLeft1++;
Real const rRight = *arRight1++;
rSum += rLeft * rRight;
}
}
}
Real const rScale = exp2( -3 * ugInt.si );
return rScale * rSum;
}
else
{
return 0.0;
}
}
Note that Real is an alias for double. What is wrong?

Related

ifft results are different from original signal

FFT works fine, but when I want to take IFFT I always see the same graph from its results. Results are complex and graph always the same regardless of the original signal.
in real part graph is a -sin with period = frame size
in imaginary part it is a -cos with the same period
Where can be a problem?
original signal:
IFFT real value (on pics are only half of frame):
Algorithm FFT that I use.
double** FFT(double** f, int s, bool inverse) {
if (s == 1) return f;
int sH = s / 2;
double** fOdd = new double*[sH];
double** fEven = new double*[sH];
for (int i = 0; i < sH; i++) {
int j = 2 * i;
fOdd[i] = f[j];
fEven[i] = f[j + 1];
}
double** sOdd = FFT(fOdd, sH, inverse);
double** sEven = FFT(fEven, sH, inverse);
double**spectr = new double*[s];
double arg = inverse ? DoublePI / s : -DoublePI / s;
double*oBase = new double[2]{ cos(arg),sin(arg) };
double*o = new double[2]{ 1,0 };
for (int i = 0; i < sH; i++) {
double* sO1 = Mul(o, sOdd[i]);
spectr[i] = Sum(sEven[i], sO1);
spectr[i + sH] = Dif(sEven[i], sO1);
o = Mul(o, oBase);
}
return spectr;
}
The "butterfly" portion is applying the coefficients incorrectly:
for (int i = 0; i < sH; i++) {
double* sO1 = sOdd[i];
double* sE1 = Mul(o, sEven[i]);
spectr[i] = Sum(sO1, sE1);
spectr[i + sH] = Dif(sO1, sE1);
o = Mul(o, oBase);
}
Side Note:
I kept your notation but it makes things confusing:
fOdd has indexes 0, 2, 4, 6, ... so it should be fEven
fEven has indexes 1, 3, 5, 7, ... so it should be fOdd
really sOdd should be sLower and sEven should be sUpper since they correspond to the 0:s/2 and s/2:s-1 elements of the spectrum respectively:
sLower = FFT(fEven, sH, inverse); // fEven is 0, 2, 4, ...
sUpper = FFT(fOdd, sH, inverse); // fOdd is 1, 3, 5, ...
Then the butterfly becomes:
for (int i = 0; i < sH; i++) {
double* sL1 = sLower[i];
double* sU1 = Mul(o, sUpper[i]);
spectr[i] = Sum(sL1, sU1);
spectr[i + sH] = Dif(sL1, sU1);
o = Mul(o, oBase);
}
When written like this it is easier to compare to this pseudocode example on wikipedia.
And #Dai is correct you are going to leak a lot of memory
Regarding the memory, you can use the std::vector to encapsulate dynamically-allocated arrays and to ensure they're deallocated when execution leaves scope. You could use unique_ptr<double[]> but the performance gains are not worth it IMO and you lose the safety of the at() method.
(Based on #Robb's answer)
A few other tips:
Avoid cryptic identifiers - programs should be readable, and names like "f" and "s" make your program harder to read and maintain.
Type-based Hungarian notation is frowned upon as modern editors show type information automatically so it adds unnecessary complication to identifier names.
Use size_t for indexes, not int
The STL is your friend, use it!
Preemptively prevent bugs by using const to prevent accidental mutation of read-only data.
Like so:
#include <vector>
using namespace std;
vector<double> fastFourierTransform(const vector<double> signal, const bool inverse) {
if( signal.size() < 2 ) return signal;
const size_t half = signal.size() / 2;
vector<double> lower; lower.reserve( half );
vector<double> upper; upper.reserve( half );
bool isEven = true;
for( size_t i = 0; i < signal.size(); i++ ) {
if( isEven ) lower.push_back( signal.at( i ) );
else upper.push_back( signal.at( i ) );
isEven = !isEven;
}
vector<double> lowerFft = fastFourierTransform( lower, inverse );
vector<double> upperFft = fastFourierTransform( upper, inverse );
vector<double> result;
result.reserve( signal.size() );
double arg = ( inverse ? 1 : -1 ) * ( DoublePI / signal.size() );
// Ideally these should be local `double` values passed directly into `Mul`.
unique_ptr<double[]> oBase = make_unique<double[]>( 2 );
oBase[0] = cos(arg);
oBase[1] = sin(arg);
unique_ptr<double[]> o = make_unique<double[]>( 2 );
o[0] = 0;
o[1] = 0;
for( size_t i = 0; i < half; i++ ) {
double* lower1 = lower.at( i );
double* upper1 = Mul( o, upper.at( i ) );
result.at( i ) = Sum( lower1, upper1 );
result.at( i + half ) = Dif( lower1, upper1 );
o = Mul( o, oBase );
}
// My knowledge of move-semantics of STL containers is a bit rusty - so there's probably a better way to return the output 'result' vector.
return result;
}

function template parametrized by other function with different number of arguments

I'm able to make function template parametrized by an other function, however, I don't know how to do it when I want to parametrize it by function with different number of arguments.
See this code:
#include <stdio.h>
#include <math.h>
template < double FUNC( double a ) >
void seq_op( int n, double * as ){
for (int i=0; i<n; i++){ printf( " %f \n", FUNC( as[i] ) ); }
}
template < double FUNC( double a, double b ) >
void seq_op_2( int n, double * as, double * bs ){
for (int i=0; i<n; i++){ printf( " %f \n", FUNC( as[i], bs[i] ) ); }
}
double a_plus_1 ( double a ){ return a + 1.0; }
double a_sq ( double a ){ return a*a; }
double a_plus_b ( double a, double b ){ return a + b; }
double a_times_b( double a, double b ){ return a * b; }
double as[5] = {1,2,3,4};
double bs[5] = {2,2,2,2};
// FUNCTION ====== main
int main(){
printf( "seq_op <a_plus_1> ( 5, as );\n"); seq_op <a_plus_1> ( 4, as );
printf( "seq_op <a_sq> ( 5, as );\n"); seq_op <a_sq> ( 4, as );
printf( "seq_op_2 <a_plus_b> ( 5, as, bs );\n"); seq_op_2 <a_plus_b> ( 4, as, bs );
printf( "seq_op_2 <a_times_b> ( 5, as, bs );\n"); seq_op_2 <a_times_b> ( 4, as, bs );
}
is there a way how to make common template for both cases?
Why I need such silly thing? A more practical example are this two functions which differs only in one line:
#define i3D( ix, iy, iz ) ( iz*nxy + iy*nx + ix )
void getLenardJonesFF( int natom, double * Rs_, double * C6, double * C12 ){
Vec3d * Rs = (Vec3d*) Rs_;
int nx = FF::n.x;
int ny = FF::n.y;
int nz = FF::n.z;
int nxy = ny * nx;
Vec3d rProbe; rProbe.set( 0.0, 0.0, 0.0 ); // we may shift here
for ( int ia=0; ia<nx; ia++ ){
printf( " ia %i \n", ia );
rProbe.add( FF::dCell.a );
for ( int ib=0; ib<ny; ib++ ){
rProbe.add( FF::dCell.b );
for ( int ic=0; ic<nz; ic++ ){
rProbe.add( FF::dCell.c );
Vec3d f; f.set(0.0,0.0,0.0);
for(int iatom=0; iatom<natom; iatom++){
// only this line differs
f.add( forceLJ( Rs[iatom] - rProbe, C6[iatom], C12[iatom] ) );
}
FF::grid[ i3D( ia, ib, ic ) ].add( f );
}
rProbe.add_mul( FF::dCell.c, -nz );
}
rProbe.add_mul( FF::dCell.b, -ny );
}
}
void getCoulombFF( int natom, double * Rs_, double * kQQs ){
Vec3d * Rs = (Vec3d*) Rs_;
int nx = FF::n.x;
int ny = FF::n.y;
int nz = FF::n.z;
int nxy = ny * nx;
Vec3d rProbe; rProbe.set( 0.0, 0.0, 0.0 ); // we may shift here
for ( int ia=0; ia<nx; ia++ ){
printf( " ia %i \n", ia );
rProbe.add( FF::dCell.a );
for ( int ib=0; ib<ny; ib++ ){
rProbe.add( FF::dCell.b );
for ( int ic=0; ic<nz; ic++ ){
rProbe.add( FF::dCell.c );
Vec3d f; f.set(0.0,0.0,0.0);
for(int iatom=0; iatom<natom; iatom++){
// only this line differs
f.add( forceCoulomb( Rs[iatom] - rProbe, kQQs[iatom] );
}
FF::grid[ i3D( ia, ib, ic ) ].add( f );
}
rProbe.add_mul( FF::dCell.c, -nz );
}
rProbe.add_mul( FF::dCell.b, -ny );
}
}
You should be able to combine the two functions using a combination of std::bind() and std::function() (see code on coliru):
#include <stdio.h>
#include <functional>
using namespace std::placeholders;
double getLJForceAtoms (int, int, double*, double*, double*)
{
printf("getLJForceAtoms\n");
return 0;
}
double getCoulombForceAtoms (int, int, double*, double*)
{
printf("getCoulombForceAtoms\n");
return 0;
}
void getFF (int natom, double* Rs_, std::function<double(int, int, double*)> GetForce)
{
int rProbe = 1;
double Force = GetForce(rProbe, natom, Rs_);
}
int main ()
{
double* C6 = nullptr;
double* C12 = nullptr;
double *kQQs = nullptr;
double* Rs_ = nullptr;
auto getLJForceFunc = std::bind(getLJForceAtoms, _1, _2, _3, C6, C12);
auto getCoulombForceFunc = std::bind(getCoulombForceAtoms, _1, _2, _3, kQQs);
getFF(1, Rs_, getLJForceFunc);
getFF(1, Rs_, getCoulombForceFunc);
return 0;
}
which outputs the expected:
getLJForceAtoms
getCoulombForceAtoms
Update -- On Performance
While it is natural to be concerned about performance of using std::function vs templates I would not omit a possible solution without first benchmarking and profiling it.
I can't compare the performance directly as I would need both your complete source code as well as input data set to make accurate benchmarks but I can do a very simple test to show you what it could look like. If I make the force functions do a little work:
double getLJForceAtoms (int x, int y, double* r1, double* r2, double* r3)
{
return cos(log2(abs(sin(log(pow(x, 2) + pow(y, 2))))));
}
and then have a very simple getFF() function call them 10 million times I can get a rough comparison between the various design methods (tests done on VS2013, release build, fast optimization flags):
Direct Call = 1900 ms
Switch = 1900 ms
If (flag) = 1900 ms
Virtual Function = 2400 ms
std::function = 2400 ms
So the std::function method is about 25% slower in this case but the switch and if methods are the same speed as the direct call case. Depending on how much work your actual force functions do you may get worse or better results. These days, the compiler optimizer and the CPU branch predictor are good enough to do a lot of things that may be surprising or even counter-intuitive, which is why actual testing must be done.
I would do a similar benchmark test with your exact code and data set and see what difference, if any, the various designs have. If you really only have two cases as shown in your question then the "if (flag)" method may be a good choice.

Modifying a function to use SSE intrinsics

I am trying to calculate the approximate value of the radical: sqrt(i + sqrt(i + sqrt(i + ...))) using SSE in order to get a speedup from vectorization (I also read that the SIMD square-root function runs approximately 4.7x faster than the innate FPU square-root function). However, I am having problems getting the same functionality in the vectorized version; I am getting the incorrect value and I'm not sure
My original function is this:
template <typename T>
T CalculateRadical( T tValue, T tEps = std::numeric_limits<T>::epsilon() )
{
static std::unordered_map<T,T> setResults;
auto it = setResults.find( tValue );
if( it != setResults.end() )
{
return it->second;
}
T tPrev = std::sqrt(tValue + std::sqrt(tValue)), tCurr = std::sqrt(tValue + tPrev);
// Keep iterating until we get convergence:
while( std::abs( tPrev - tCurr ) > tEps )
{
tPrev = tCurr;
tCurr = std::sqrt(tValue + tPrev);
}
setResults.insert( std::make_pair( tValue, tCurr ) );
return tCurr;
}
And the SIMD equivalent (when this template function is instantiated with T = float and given tEps = 0.0005f) I have written is:
// SSE intrinsics hard-coded function:
__m128 CalculateRadicals( __m128 values )
{
static std::unordered_map<float, __m128> setResults;
// Store our epsilon as a vector for quick comparison:
__declspec(align(16)) float flEps[4] = { 0.0005f, 0.0005f, 0.0005f, 0.0005f };
__m128 eps = _mm_load_ps( flEps );
union U {
__m128 vec;
float flArray[4];
};
U u;
u.vec = values;
float flFirstVal = u.flArray[0];
auto it = setResults.find( flFirstVal );
if( it != setResults.end( ) )
{
return it->second;
}
__m128 prev = _mm_sqrt_ps( _mm_add_ps( values, _mm_sqrt_ps( values ) ) );
__m128 curr = _mm_sqrt_ps( _mm_add_ps( values, prev ) );
while( _mm_movemask_ps( _mm_cmplt_ps( _mm_sub_ps( curr, prev ), eps ) ) != 0xF )
{
prev = curr;
curr = _mm_sqrt_ps( _mm_add_ps( values, prev ) );
}
setResults.insert( std::make_pair( flFirstVal, curr ) );
return curr;
}
I am calling the function in a loop using the following code:
long long N;
std::cin >> N;
float flExpectation = 0.0f;
long long iMultipleOf4 = (N / 4LL) * 4LL;
for( long long i = iMultipleOf4; i > 0LL; i -= 4LL )
{
__declspec(align(16)) float flArray[4] = { static_cast<float>(i - 3), static_cast<float>(i - 2), static_cast<float>(i - 1), static_cast<float>(i) };
__m128 arg = _mm_load_ps( flArray );
__m128 vec = CalculateRadicals( arg );
float flSum = Sum( vec );
flExpectation += flSum;
}
for( long long i = iMultipleOf4; i < N; ++i )
{
flExpectation += CalculateRadical( static_cast<float>(i), 0.0005f );
}
flExpectation /= N;
I get the following outputs for input 5:
With SSE version: 2.20873
With FPU verison: 1.69647
Where does the discrepancy come from, what am I doing wrong in the SIMD equivalent?
EDIT: I've realized that the Sum function is relevant here:
float Sum( __m128 vec1 )
{
float flTemp[4];
_mm_storeu_ps( flTemp, vec1 );
return flTemp[0] + flTemp[1] + flTemp[2] + flTemp[3];
}
SSE intrinsics can be pretty tedious sometimes...
But not here. You just screwed up your loop :
for( long long i = iMultipleOf4; i > 0LL; i -= 4LL )
I doubt it's doing what you expected. If iMultipleOf4 is 4, then your function will compute with 4,3,2,1 but not 0. And then your 2nd loop redo the computation with 4.
The two function give the same results for me, and the loops gives the same flExpectation after correction. Though there still is a small difference, probably because the FPUs have slight differences in how they compute.

3D Convolution with Intel MKL

I have written a C/C++ code which uses Intel MKL to compute the 3D convolution of an array which has about 300×200×200 elements. I want to apply a kernel which is either 3×3×3 or 5×5×5. Both the 3D input array and the kernel have real values.
This 3D array is stored as a 1D array of type double in a columnwise fashion. Similarly the kernel is of type double and is saved columnwise. For example,
for( int k = 0; k < nk; k++ ) // Loop through the height.
for( int j = 0; j < nj; j++ ) // Loop through the rows.
for( int i = 0; i < ni; i++ ) // Loop through the columns.
{
ijk = i + ni * j + ni * nj * k;
my3Darray[ ijk ] = 1.0;
}
For the computation of convolution, I want to perform not-in-place FFT on the input array and the kernel and prevent them from getting modified (I need to use them later in my code) and then do the backward computation in-place.
When I compare the result obtained from my code with the one obtained by MATLAB they are very different. Could someone kindly help me fix the issue? What is missing in my code?
Here is the MATLAB code I used:
a = ones( 10, 10, 10 );
kernel = ones( 3, 3, 3 );
aconvolved = convn( a, kernel, 'same' );
Here is my C/C++ code:
#include <stdio.h>
#include "mkl.h"
void Conv3D(
double *in, double *ker, double *out,
int nRows, int nCols, int nHeights)
{
int NI = nRows;
int NJ = nCols;
int NK = nHeights;
double *in_fft = new double [NI*NJ*NK];
double *ker_fft = new double [NI*NJ*NK];
DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
MKL_LONG sizes[] = { NK, NJ, NI };
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes );
DftiSetValue ( fft_desc, DFTI_PLACEMENT , DFTI_NOT_INPLACE); // Out-of-place computation.
DftiSetValue ( fft_desc, DFTI_INPUT_STRIDES , strides );
DftiSetValue ( fft_desc, DFTI_OUTPUT_STRIDES, strides );
DftiSetValue ( fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK );
DftiCommitDescriptor( fft_desc );
DftiComputeForward ( fft_desc, in , in_fft );
DftiComputeForward ( fft_desc, ker, ker_fft );
for (long long i = 0; i < (long long)NI*NJ*NK; ++i )
out[i] = in_fft[i]*ker_fft[i];
// In-place computation.
DftiSetValue ( fft_desc, DFTI_PLACEMENT, DFTI_INPLACE );
DftiCommitDescriptor( fft_desc );
DftiComputeBackward ( fft_desc, out );
DftiFreeDescriptor ( &fft_desc );
delete[] in_fft;
delete[] ker_fft;
}
int main(int argc, char* argv[])
{
int n = 10;
int nkernel = 3;
double *a = new double [n*n*n]; // This array is real.
double *aconvolved = new double [n*n*n]; // The convolved array is also real.
double *kernel = new double [nkernel*nkernel*nkernel]; // kernel is real.
// Fill the array with some 'real' numbers.
for( int i = 0; i < n*n*n; i++ )
a[ i ] = 1.0;
// Fill the kernel with some 'real' numbers.
for( int i = 0; i < nkernel*nkernel*nkernel; i++ )
kernel[ i ] = 1.0;
// Calculate the convolution.
Conv3D( a, kernel, aconvolved, n, n, n );
printf("Convolved:\n");
for( int i = 0; i < n*n*n; i++ )
printf( "%15.8f\n", aconvolved[i] );
delete[] a;
delete[] kernel;
delete[] aconvolved;
return 0;
}
You can't reverse the FFT with real-valued frequency data (just the magnitude). A forward FFT needs to output complex data. This is done by setting the DFTI_FORWARD_DOMAIN setting to DFTI_COMPLEX.
DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes );
Doing this implicitly sets the backward domain to complex too.
You will also need a complex data type. Probably something like,
MKL_Complex16* in_fft = new MKL_Complex16[NI*NJ*NK];
This means you will have to multiply both the real and imaginary parts:
for (size_t i = 0; i < (size_t)NI*NJ*NK; ++i) {
out_fft[i].real = in_fft[i].real * ker_fft[i].real;
out_fft[i].imag = in_fft[i].imag * ker_fft[i].imag;
}
The output of the inverse FFT is also complex, and assuming your input data is real, you can just grab the .real component and that is your result. This means you'll need a temporary complex output array (say, out_fft as above).
Also note that to avoid artifacts, you want the size of your fft to be (at least) M+N-1 on each dimension. Generally you would choose the next highest power of two for speed.
I strongly suggest you implement it in MATLAB first, using FFTs. There are many such implementations available (example), but I would start from the basics and make a simple function on your own.

FFTW and OpenCV's C++ interface, real and imaginary part in Mat output

I'm trying to code a FFT/IFFT function with FFTW 3.3 and OpenCV 2.1 using the C++ interface. I've seen a lot of examples using the old OpenCV formats and I did a direct conversion, but something doesn't work.
The objective of my function is to return a Mat object with the real part and the imaginary part of the FFT, like dft default OpenCV function does. Here is the code of the function. Program gets blocked with memory problem in the lines that copy im_data to data_in.
Does somebody know what am I doing wrong? Thank you
Mat fft_sr(Mat& I)
{
double *im_data;
double *realP_data;
double *imP_data;
fftw_complex *data_in;
fftw_complex *fft;
fftw_plan plan_f;
int width = I.cols;
int height = I.rows;
int step = I.step;
int i, j, k;
Mat realP=Mat::zeros(height,width,CV_64F); // Real Part FFT
Mat imP=Mat::zeros(height,width,CV_64F); // Imaginary Part FFT
im_data = ( double* ) I.data;
realP_data = ( double* ) realP.data;
imP_data = ( double* ) imP.data;
data_in = ( fftw_complex* )fftw_malloc( sizeof( fftw_complex ) * width * height );
fft = ( fftw_complex* )fftw_malloc( sizeof( fftw_complex ) * width * height );
// Problem Here
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width ; j++ ) {
data_in[k][0] = ( double )im_data[i * step + j];
data_in[k][1] = ( double )0.0;
k++;
}
}
plan_f = fftw_plan_dft_2d( height, width, data_in, fft, FFTW_FORWARD, FFTW_ESTIMATE );
fftw_execute( plan_f );
// Copy real and imaginary data
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width ; j++ ) {
realP_data[i * step + j] = ( double )fft[k][0];
imP_data[i * step + j] = ( double )fft[k][1];
k++;
}
}
Mat fft_I(I.size(),CV_64FC2);
Mat fftplanes[] = {Mat_<double>(realP), Mat_<double>(imP)};
merge(fftplanes, 2, fft_I);
fftw_destroy_plan(plan_f);
fftw_free(data_in);
fftw_free(fft);
return fft_I;
}
You are using step wrong. It is meant to index into Mat::data. Since you already casted Mat::data to double* when assigning it to im_data, you can index into im_data "normally":
data_in[k][0] = im_data[i * width + j];
When using step the correct way to index is:
data_in[k][0] = ( double )I.data[i * step + j];
Update:
Try to access your images row-wise. That way you avoid running into problems with stride/step, while still exploiting fast access:
for (int i = 0; i < I.rows; i++)
{
double* row = I.ptr<double>(i);
for (int j = 0; j < I.cols; j++)
{
// Do something with the current pixel.
double someValue = row[j];
}
}
I know this is old but when you are using fftw you need to initialize fftw_complex *data_in
only after creating the plan for the fft, if i recall correctly when you create the plan it sets all the
*data_in values to 0.
so allocate before the plan and initialize after!
Statement
im_data = ( double* ) I.data;
defines im_data as double pointer to image data.
I think that should be mandatory that I was a double values image.