I'm trying to figure out the vDSP functions and the results I'm getting are very strange.
Basically I am trying to make sense of vDSP_vdist as I start off with a vector of std::complex< float >. Now AFAIK I should be able to calculate the magnitude by, simply, doing:
// std::abs of a complex does sqrtf( r^2 + i^2 ).
pOut[idx] = std::abs( pIn[idx] );
However when I do this I see the spectrum reflected around the midpoint of the vector. This is very strange.
Oddly, however, if I use a vDSP_ztoc followed by a vDSP_vdist I get exactly the results I expect. So I wrote a bit of code to try and understand whats going wrong.
bool VecMagnitude( float* pOut, const std::complex< float >* pIn, unsigned int num )
std::vector< float > realTemp( num );
std::vector< float > imagTemp( num );
DSPSplitComplex dspsc;
dspsc.realp = &realTemp.front();
dspsc.imagp = &imagTemp.front();
vDSP_ctoz( (DSPComplex*)pIn, 1, &dspsc, 1, num );
int idx = 0;
while( idx < num )
if ( fabsf( dspsc.realp[idx] - pIn[idx].real() ) > 0.0001f ||
fabsf( dspsc.imagp[idx] - pIn[idx].imag() ) > 0.0001f )
char temp[256];
sprintf( temp, "%f, %f - %f, %f", dspsc.realp[idx], dspsc.imagp[idx], pIn[idx].real(), pIn[idx].imag() );
fprintf( stderr, temp );
return true;
Now whats strange is the above code starts failing when idx = 1 and continues to the end. The reason is that dspsc.realp[1] == pIn[0].imag(). Its like instead of splitting it into 2 different buffers that it has straight memcpy'd half the vector of std::complexes into dspsc.realp. ie the 2 floats at std::complex[0] then the 2 floats in std::complex[1] and so on. dspsc.imagp is much the same. dspsc.imagp[1] = pIn[1].real().
This just makes no sense. Can someone explain where on earth I'm failing to understand whats going on?


OpenMp parallel for

I have the following method called pgain which calls the method dist that I am trying to parallize:
/* For a given point x, find the cost of the following operation:
* -- open a facility at x if there isn't already one there,
* -- for points y such that the assignment distance of y exceeds dist(y, x),
* make y a member of x,
* -- for facilities y such that reassigning y and all its members to x
* would save cost, realize this closing and reassignment.
* If the cost of this operation is negative (i.e., if this entire operation
* saves cost), perform this operation and return the amount of cost saved;
* otherwise, do nothing.
/* numcenters will be updated to reflect the new number of centers */
/* z is the facility cost, x is the number of this point in the array
points */
double pgain ( long x, Points *points, double z, long int *numcenters )
int i;
int number_of_centers_to_close = 0;
static double *work_mem;
static double gl_cost_of_opening_x;
static int gl_number_of_centers_to_close;
int stride = *numcenters + 2;
//make stride a multiple of CACHE_LINE
int cl = CACHE_LINE/sizeof ( double );
if ( stride % cl != 0 ) {
stride = cl * ( stride / cl + 1 );
int K = stride - 2 ; // K==*numcenters
//my own cost of opening x
double cost_of_opening_x = 0;
work_mem = ( double* ) malloc ( 2 * stride * sizeof ( double ) );
gl_cost_of_opening_x = 0;
gl_number_of_centers_to_close = 0;
* For each center, we have a *lower* field that indicates
* how much we will save by closing the center.
int count = 0;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
center_table[i] = count++;
work_mem[0] = 0;
//now we finish building the table. clear the working memory.
memset ( switch_membership, 0, points->num * sizeof ( bool ) );
memset ( work_mem, 0, stride*sizeof ( double ) );
memset ( work_mem+stride,0,stride*sizeof ( double ) );
//my *lower* fields
double* lower = &work_mem[0];
//global *lower* fields
double* gl_lower = &work_mem[stride];
#pragma omp parallel for
for ( i = 0; i < points->num; i++ ) {
float x_cost = dist ( points->p[i], points->p[x], points->dim ) * points->p[i].weight;
float current_cost = points->p[i].cost;
if ( x_cost < current_cost ) {
// point i would save cost just by switching to x
// (note that i cannot be a median,
// or else dist(p[i], p[x]) would be 0)
switch_membership[i] = 1;
cost_of_opening_x += x_cost - current_cost;
} else {
// cost of assigning i to x is at least current assignment cost of i
// consider the savings that i's **current** median would realize
// if we reassigned that median and all its members to x;
// note we've already accounted for the fact that the median
// would save z by closing; now we have to subtract from the savings
// the extra cost of reassigning that median and its members
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost - x_cost;
// at this time, we can calculate the cost of opening a center
// at x; if it is negative, we'll go through with opening it
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
double low = z + work_mem[center_table[i]];
gl_lower[center_table[i]] = low;
if ( low > 0 ) {
// i is a median, and
// if we were to open x (which we still may not) we'd close i
// note, we'll ignore the following quantity unless we do open x
cost_of_opening_x -= low;
//use the rest of working memory to store the following
work_mem[K] = number_of_centers_to_close;
work_mem[K+1] = cost_of_opening_x;
gl_number_of_centers_to_close = ( int ) work_mem[K];
gl_cost_of_opening_x = z + work_mem[K+1];
// Now, check whether opening x would save cost; if so, do it, and
// otherwise do nothing
if ( gl_cost_of_opening_x < 0 ) {
// we'd save money by opening x; we'll do it
for ( int i = 0; i < points->num; i++ ) {
bool close_center = gl_lower[center_table[points->p[i].assign]] > 0 ;
if ( switch_membership[i] || close_center ) {
// Either i's median (which may be i itself) is closing,
// or i is closer to x than to its current median
points->p[i].cost = points->p[i].weight * dist ( points->p[i], points->p[x], points->dim );
points->p[i].assign = x;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] && gl_lower[center_table[i]] > 0 ) {
is_center[i] = false;
if ( x >= 0 && x < points->num ) {
is_center[x] = true;
*numcenters = *numcenters + 1 - gl_number_of_centers_to_close;
} else {
gl_cost_of_opening_x = 0; // the value we'll return
free ( work_mem );
return -gl_cost_of_opening_x;
The function that I am trying to parallelize:
/* compute Euclidean distance squared between two points */
float dist ( Point p1, Point p2, int dim )
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
return ( result );
With Point being this:
/* this structure represents a point */
/* these will be passed around to avoid copying coordinates */
typedef struct {
float weight;
float *coord;
long assign; /* number of point where this one is assigned */
float cost; /* cost of that assignment, weight*distance */
} Point;
I have a large application of streamcluster(815 lines of code) that produces real time numbers and sorts them in a specific way. I have used scalasca tool on Linux so I can measure the methods that take up most of the time and I have found that method dist listed above is the most time-consuming. I am trying to use openMP tools but the time that the parallelized code runs is more than the time the serial code. If serial code runs in 1,5 sec the parallelized takes 20 but the results are the same. And I am wondering is it that I can't parallelize this part of code for some reason or that I don't do it correctly.
The method I am trying to parallelize its in a call tree: main->pkmedian->pFL->pgain->dist (-> means that calls the following method)
The code you've chosen to parallelize:
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
is a poor candidate to benefit from parallelization. You should not use parallel for here. You should probably not use parallelization on an inner loop. If you can parallelize some outer loop, you're much more like to see gains.
There is an overhead to coordinate the thread team to start the parallel region and another overhead for performing the reduction afterwards. Meanwhile, the parallel region's contents take essentially no time to run. Given that, you'd need dim to be extremely large before you'd expect this to give a performance benefit.
To express that point more graphically, consider that the math you're doing will take nanoseconds and compare it against this chart showing the overhead of various OpenMP directives.
If you need this to run faster, your first stop should be to use appropriate compilation flags, followed by looking into SIMD operations: SSE and AVX are good keywords. Your compiler might even invoke them automatically.
I've built some test code (see below) and compiled it with various optimizations enabled, as listed below, and run it on arrays of 100,000 elements. Note that enabling -O3 results in a run-time that is on the order of the OpenMP directives. This implies that you'd want arrays of about 400,000 before you'd want to think about using OpenMP and probably more like 1,000,000, to be safe.
No optimizations. Run-time is ~1900μs.
-O3: Enables many optimizations. Run-time is ~200μs.
-ffast-math: You want this, unless you're doing some very tricky things. Run-time is about the same.
-march=native: Compile code to use the full capabilities of your CPU, rather than a generic instruction set that would work on many CPUs. Run-time is ~100μs.
So there we go, strategic use of compiler options (-march=native) can double the speed of the code in question without having to muck about in parallelism.
Here is a handy slide presentation with some tips explaining how to use OpenMP in a performant manner.
Test code:
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
int main(){
std::vector<double> a;
std::vector<double> b;
for(int i=0;i<100000;i++){
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
float result = 0.0;
//#pragma omp parallel for reduction(+:result)
for (unsigned int i=0; i<a.size(); i++ )
result += ( a[i] - b[i] ) * ( a[i] - b[i] );
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds"<<std::endl;

Noise++ Perlin Module returns 0 all the time

Im using the Noise++ Library to generate noise in my program, well at least thats the aim.
I have it setup like one of the tests in order to test it out, however no matter what parameters I give it I only get 0 back
If anyone has any experience with Noise++ it would be really helpful if you could check over and see if im doing anything wrong.
// Defaults are
// Frequency = 1
// Octaves = 6
// Seed = 0
// Quality = 1
// Lacunarity = 2
// Persistence = 0.5
// Scale = 2.12
NoiseppNoise::NoiseppNoise( ) : mPipeline2d( 2 )
mThreadCount = noisepp::utils::System::getNumberOfCPUs ();
if ( mThreadCount > 2 ) {
mPipeline2d = noisepp::ThreadedPipeline2D( mThreadCount );
mNoiseID2D = mPerlin.addToPipe ( mPipeline2d );
mCache2d = mPipeline2d.createCache();
double NoiseppNoise::Generate( double x, double y )
return mPipeline2d.getElement( mNoiseID2D )->getValue ( x, y, mCache2d );
I have added the following lines to your code to compile (basically no changes except for cleaning the cache):
struct NoiseppNoise
double Generate( double x, double y );
noisepp::ThreadedPipeline2D mPipeline2d;
noisepp::ElementID mThreadCount;
noisepp::PerlinModule mPerlin;
noisepp::ElementID mNoiseID2D;
noisepp::Cache* mCache2d;
/* constructor as in the question */
double NoiseppNoise::Generate( double x, double y )
mPipeline2d.cleanCache (mCache2d); // clean the cache before calculating value
return mPipeline2d.getElement( mNoiseID2D )->getValue ( x, y, mCache2d );
Calling it with
NoiseppNoise np;
actually outputs a good value, 0.0909 for me.
However if you call it with two "integers" (e.g. 3.0 and 5.0) the output will be 0 because at some point something similar to the following statement is executed:
const Real xs = Math::CubicCurve3 (x - Real(x0));
If the parameters are integers then x and Real(x0) are always the same because Real(x0) is basically the integer part of x and so xs will be set to 0. After this there are more calculations to get the actual value but it becomes deterministically 0.

Implementing De Boors algorithm for finding points on a B-spline

I've been working on this for several weeks but have been unable to get my algorithm working properly and i'm at my wits end. Here's an illustration of what i have achieved:
If everything was working i would expect a perfect circle/oval at the end.
My sample points (in white) are recalculated every time a new control point (in yellow) is added. At 4 control points everything looks perfect, again as i add a 5th on top of the 1st things look alright, but then on the 6th it starts to go off too the side and on the 7th it jumps up to the origin!
Below I'll post my code, where calculateWeightForPointI contains the actual algorithm. And for reference- here is the information i'm trying to follow. I'd be so greatful if someone could take a look for me.
void updateCurve(const std::vector<glm::vec3>& controls, std::vector<glm::vec3>& samples)
int subCurveOrder = 4; // = k = I want to break my curve into to cubics
// De boor 1st attempt
if(controls.size() >= subCurveOrder)
createKnotVector(subCurveOrder, controls.size());
for(int steps=0; steps<=20; steps++)
// use steps to get a 0-1 range value for progression along the curve
// then get that value into the range [k-1, n+1]
// k-1 = subCurveOrder-1
// n+1 = always the number of total control points
float t = ( steps / 20.0f ) * ( controls.size() - (subCurveOrder-1) ) + subCurveOrder-1;
glm::vec3 newPoint(0,0,0);
for(int i=1; i <= controls.size(); i++)
float weightForControl = calculateWeightForPointI(i, subCurveOrder, controls.size(), t);
newPoint += weightForControl * controls.at(i-1);
//i = the weight we're looking for, i should go from 1 to n+1, where n+1 is equal to the total number of control points.
//k = curve order = power/degree +1. eg, to break whole curve into cubics use a curve order of 4
//cps = number of total control points
//t = current step/interp value
float calculateWeightForPointI( int i, int k, int cps, float t )
//test if we've reached the bottom of the recursive call
if( k == 1 )
if( t >= knot(i) && t < knot(i+1) )
return 1;
return 0;
float numeratorA = ( t - knot(i) );
float denominatorA = ( knot(i + k-1) - knot(i) );
float numeratorB = ( knot(i + k) - t );
float denominatorB = ( knot(i + k) - knot(i + 1) );
float subweightA = 0;
float subweightB = 0;
if( denominatorA != 0 )
subweightA = numeratorA / denominatorA * calculateWeightForPointI(i, k-1, cps, t);
if( denominatorB != 0 )
subweightB = numeratorB / denominatorB * calculateWeightForPointI(i+1, k-1, cps, t);
return subweightA + subweightB;
//returns the knot value at the passed in index
//if i = 1 and we want Xi then we have to remember to index with i-1
float knot(int indexForKnot)
// When getting the index for the knot function i remember to subtract 1 from i because of the difference caused by us counting from i=1 to n+1 and indexing a vector from 0
return knotVector.at(indexForKnot-1);
//calculate the whole knot vector
void createKnotVector(int curveOrderK, int numControlPoints)
int knotSize = curveOrderK + numControlPoints;
for(int count = 0; count < knotSize; count++)
Your algorithm seems to work for any inputs I tried it on. Your problem might be a that a control point is not where it is supposed to be, or that they haven't been initialized properly. It looks like there are two control-points, half the height below the bottom left corner.

Can someone write this code with better logic?

I am stuck with this problem for 2 days. Can someone help me with the logic ?
I am working on C++ programs for good algorithms. I am now working on the Danielson-Lanczos Algorithm to compute the FFT of a sequence.
Looking at
while (n>mmax) {
istep = mmax<<1;
theta = -(2*M_PI/mmax);
wtemp = sin(0.5*theta);
wpr = -2.0*wtemp*wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for (m=1; m < mmax; m += 2) {
for (i=m; i <= n; i += istep) {
tempr = wr*data[j-1] - wi*data[j];
tempi = wr * data[j] + wi*data[j-1];
data[j-1] = data[i-1] - tempr;
data[j] = data[i] - tempi;
data[i-1] += tempr;
data[i] += tempi;
wr += wr*wpr - wi*wpi;
wi += wi*wpr + wtemp*wpi;
Source: http://www.eetimes.com/design/signal-processing-dsp/4017495/A-Simple-and-Efficient-FFT-Implementation-in-C--Part-I
Is there any way to logically write a code such that the whole for-loop portion is reduced to just 4 lines of code (or even better)?
Better indentation would go a long way. I fixed that for you. Also, this seems to beg for better locality of the variables. The variable names are not clear to me, but that might be because I don't know the domain this algorithm belongs to.
Generally, if you want to make complex code easier to understand, identify sub-algorithms and put them into their own (inlined) functions. (Putting a code snippet into a function effectively gives it a name, and makes the passing of variables into and out of the code more obvious. Often, that makes code easier to digest.)
I'm not sure this is necessary for this piece of code, though.
Merely condensing code, however, will not make it more readable. Instead, it will just make it more condensed.
Do not compress your code. Please? Pretty please? With a cherry on top?
Unless you can create a better algorithm, compressing an existing piece of code will only make it look like something straight out of the gates of Hell itself. No one would be able to understand it. You would not be able to understand it, even a few days later.
Even the compiler might get too confused by all the branches to properly optimize it.
If you are trying to improve performance, consider the following:
Premature optimization is the source of all evils.
Work on your algorithms first, then on your code.
The line count may have absolutely no relation to the size of the produced executable code.
Compilers do not like entangled code paths and complex expressions. Really...
Unless the code is really performance critical, readability trumps everything else.
If it is performance critical, profile first, then start optimizing.
You could use a complex number class to reflect the math involved.
A good part of the code is made of two complex multiplications.
You can rewrite your code as :
unsigned long mmax=2;
while (n>mmax)
unsigned long istep = mmax<<1;
const complex wp = coef( mmax );
complex w( 1. , 0. );
for (unsigned long m=1; m < mmax; m += 2)
for (unsigned long i=m; i <= n; i += istep)
complex temp = w * complex( data[j-1] , data[j] );
complexref( data[j-1] , data[j] ) = complex( data[i-1] , data[i] ) - temp ;
complexref( data[i-1] , data[i] ) += temp ;
w += w * wp ;
With :
struct complex
double r , i ;
complex( double r , double i ) : r( r ) , i( i ) {}
inline complex & operator+=( complex const& ref )
r += ref.r ;
i += ref.i ;
return *this ;
struct complexref
double & r , & i ;
complexref( double & r , double & i ) : r( r ) , i( i ) {}
inline complexref & operator=( complex const& ref )
r = ref.r ;
i = ref.i ;
return *this ;
inline complexref & operator+=( complex const& ref )
r += ref.r ;
i += ref.i ;
return *this ;
} ;
inline complex operator*( complex const& w , complex const& b )
return complex(
w.r * b.r - w.i * b.i ,
w.r * b.i + w.i * b.r
inline complex operator-( complex const& w , complex const& b )
return complex( w.r - b.r , w.i - b.i );
inline complex coef( unsigned long mmax )
double theta = -(2*M_PI/mmax);
double wtemp = sin(0.5*theta);
return complex( -2.0*wtemp*wtemp , sin(theta) );
I don't believe you would be able to make it substantially shorter.
If this code were made much shorter, I would guess that it would significantly diminish readability.
Since the logic is relatively clear, number of lines does not matter — unless you're planning on using this on codegolf.stackexchange.com, this is a place where you should trust your compiler to help you (because it will)
This strikes me as premature optimization.

Ramer-Douglas-Peucker path simplification algorithm

I implemented a path simplification algorithm after reading the article here:
It's worked for me pretty well for generating optimized level geometry for my game. But, I'm using it now to clean up a* pathfinding paths and it's got a weird edge case that fails miserably.
Here's a screenshot of it working - optimizing the path from red circle to the blue circle. The faint green line is the a* output, and the lighter whiteish line is the optimized path.
And here's a screenshot of it failing:
Here's my code. I adapted the ObjC code from the article to c++
Note: vec2fvec is a std::vector< vec2<float> >, and 'real' is just a typedef'd float.
void rdpSimplify( const vec2fvec &in, vec2fvec &out, real threshold )
if ( in.size() <= 2 )
out = in;
// Find the vertex farthest from the line defined by the start and and of the path
real maxDist = 0;
size_t maxDistIndex = 0;
LineSegment line( in.front(), in.back() );
for ( vec2fvec::const_iterator it(in.begin()),end(in.end()); it != end; ++it )
real dist = line.distance( *it );
if ( dist > maxDist )
maxDist = dist;
maxDistIndex = it - in.begin();
// If the farhtest vertex is greater than our threshold, we need to
// partition and optimize left and right separately
if ( maxDist > threshold )
// Partition 'in' into left and right subvectors, and optimize them
vec2fvec left( maxDistIndex+1 ),
right( in.size() - maxDistIndex ),
std::copy( in.begin(), in.begin() + maxDistIndex + 1, left.begin() );
std::copy( in.begin() + maxDistIndex, in.end(), right.begin() );
rdpSimplify(left, leftSimplified, threshold );
rdpSimplify(right, rightSimplified, threshold );
// Stitch optimized left and right into 'out'
out.resize( leftSimplified.size() + rightSimplified.size() - 1 );
std::copy( leftSimplified.begin(), leftSimplified.end(), out.begin());
std::copy( rightSimplified.begin() + 1, rightSimplified.end(), out.begin() + leftSimplified.size() );
out.push_back( line.a );
out.push_back( line.b );
I'm really at a loss as to what's going wrong. My spidey sense says it's in the std::copy calls... I must be copying garbage in some circumstances.
I've rewritten the algorithm dropping any use of iterators and std::copy, and the like. It still fails in the exact same way.
void rdpSimplify( const vec2fvec &in, vec2fvec &out, real threshold )
if ( in.size() <= 2 )
out = in;
// Find the vertex farthest from the line defined by the start and and of the path
real maxDist = 0;
size_t maxDistIndex = 0;
LineSegment line( in.front(), in.back() );
for ( size_t i = 0, N = in.size(); i < N; i++ )
real dist = line.distance( in[i] );
if ( dist > maxDist )
maxDist = dist;
maxDistIndex = i;
// If the farthest vertex is greater than our threshold, we need to
// partition and optimize left and right separately
if ( maxDist > threshold )
// Partition 'in' into left and right subvectors, and optimize them
vec2fvec left, right, leftSimplified, rightSimplified;
for ( size_t i = 0; i < maxDistIndex + 1; i++ ) left.push_back( in[i] );
for ( size_t i = maxDistIndex; i < in.size(); i++ ) right.push_back( in[i] );
rdpSimplify(left, leftSimplified, threshold );
rdpSimplify(right, rightSimplified, threshold );
// Stitch optimized left and right into 'out'
for ( size_t i = 0, N = leftSimplified.size(); i < N; i++ ) out.push_back(leftSimplified[i]);
for ( size_t i = 1, N = rightSimplified.size(); i < N; i++ ) out.push_back( rightSimplified[i] );
out.push_back( line.a );
out.push_back( line.b );
I can't find any faults in your code.
Some things to try:
Add some debug print statements to check what maxDist is in the failing case. It should be really low, but if it comes out high then you know there's a problem with your line segment distance code.
Check that the path you are seeing actually matches the path that your algorithm returns. If not then perhaps there is something wrong with your path rendering? Maybe a bug when the path only has two points?
Check that your input path is what you expect it to be by printing out all its coordinates at the start of the algorithm.
It shouldn't take too long to find the cause of the problem if you just investigate a little. After a few minutes, staring at code is a very poor way to debug.