I am stuck with this problem for 2 days. Can someone help me with the logic ?
I am working on C++ programs for good algorithms. I am now working on the Danielson-Lanczos Algorithm to compute the FFT of a sequence.
Looking at
mmax=2;
while (n>mmax) {
istep = mmax<<1;
theta = -(2*M_PI/mmax);
wtemp = sin(0.5*theta);
wpr = -2.0*wtemp*wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for (m=1; m < mmax; m += 2) {
for (i=m; i <= n; i += istep) {
j=i+mmax;
tempr = wr*data[j-1] - wi*data[j];
tempi = wr * data[j] + wi*data[j-1];
data[j-1] = data[i-1] - tempr;
data[j] = data[i] - tempi;
data[i-1] += tempr;
data[i] += tempi;
}
wtemp=wr;
wr += wr*wpr - wi*wpi;
wi += wi*wpr + wtemp*wpi;
}
mmax=istep;
}
Source: http://www.eetimes.com/design/signal-processing-dsp/4017495/A-Simple-and-Efficient-FFT-Implementation-in-C--Part-I
Is there any way to logically write a code such that the whole for-loop portion is reduced to just 4 lines of code (or even better)?
Better indentation would go a long way. I fixed that for you. Also, this seems to beg for better locality of the variables. The variable names are not clear to me, but that might be because I don't know the domain this algorithm belongs to.
Generally, if you want to make complex code easier to understand, identify sub-algorithms and put them into their own (inlined) functions. (Putting a code snippet into a function effectively gives it a name, and makes the passing of variables into and out of the code more obvious. Often, that makes code easier to digest.)
I'm not sure this is necessary for this piece of code, though.
Merely condensing code, however, will not make it more readable. Instead, it will just make it more condensed.
Do not compress your code. Please? Pretty please? With a cherry on top?
Unless you can create a better algorithm, compressing an existing piece of code will only make it look like something straight out of the gates of Hell itself. No one would be able to understand it. You would not be able to understand it, even a few days later.
Even the compiler might get too confused by all the branches to properly optimize it.
If you are trying to improve performance, consider the following:
Premature optimization is the source of all evils.
Work on your algorithms first, then on your code.
The line count may have absolutely no relation to the size of the produced executable code.
Compilers do not like entangled code paths and complex expressions. Really...
Unless the code is really performance critical, readability trumps everything else.
If it is performance critical, profile first, then start optimizing.
You could use a complex number class to reflect the math involved.
A good part of the code is made of two complex multiplications.
You can rewrite your code as :
unsigned long mmax=2;
while (n>mmax)
{
unsigned long istep = mmax<<1;
const complex wp = coef( mmax );
complex w( 1. , 0. );
for (unsigned long m=1; m < mmax; m += 2)
{
for (unsigned long i=m; i <= n; i += istep)
{
j=i+mmax;
complex temp = w * complex( data[j-1] , data[j] );
complexref( data[j-1] , data[j] ) = complex( data[i-1] , data[i] ) - temp ;
complexref( data[i-1] , data[i] ) += temp ;
}
w += w * wp ;
}
mmax=istep;
}
With :
struct complex
{
double r , i ;
complex( double r , double i ) : r( r ) , i( i ) {}
inline complex & operator+=( complex const& ref )
{
r += ref.r ;
i += ref.i ;
return *this ;
}
};
struct complexref
{
double & r , & i ;
complexref( double & r , double & i ) : r( r ) , i( i ) {}
inline complexref & operator=( complex const& ref )
{
r = ref.r ;
i = ref.i ;
return *this ;
}
inline complexref & operator+=( complex const& ref )
{
r += ref.r ;
i += ref.i ;
return *this ;
}
} ;
inline complex operator*( complex const& w , complex const& b )
{
return complex(
w.r * b.r - w.i * b.i ,
w.r * b.i + w.i * b.r
);
}
inline complex operator-( complex const& w , complex const& b )
{
return complex( w.r - b.r , w.i - b.i );
}
inline complex coef( unsigned long mmax )
{
double theta = -(2*M_PI/mmax);
double wtemp = sin(0.5*theta);
return complex( -2.0*wtemp*wtemp , sin(theta) );
}
I don't believe you would be able to make it substantially shorter.
If this code were made much shorter, I would guess that it would significantly diminish readability.
Since the logic is relatively clear, number of lines does not matter — unless you're planning on using this on codegolf.stackexchange.com, this is a place where you should trust your compiler to help you (because it will)
This strikes me as premature optimization.
Related
I am trying to implement Non Linear MPC for a 7-DOF manipulator in drake. To do this, in my constraints, I need to have dynamic parameters like the Mass matrix M(q) and the bias term C(q,q_dot)*q_dot, but those depend on the decision variables q, q_dot.
I tried the following
// finalize plant
// create builder, diagram, context, plant context
...
// formulate optimazation problem
drake::solvers::MathematicalProgram prog;
// create decision variables
...
std::vector<drake::solvers::VectorXDecisionVariable> q_v;
std::vector<drake::solvers::VectorXDecisionVariable> q_ddot;
for (int i = 0; i < H; i++) {
q_v.push_back(prog.NewContinuousVariables<14>(state_var_name));
q_ddot.push_back(prog.NewContinuousVariables<7>(input_var_name));
}
// add cost
...
// add constraints
...
for (int i = 0; i < H; i++) {
plant.SetPositionsAndVelocities(*plant_context, q_v[i]);
plant.CalcMassMatrix(*plant_context, M);
plant.CalcBiasTerm(*plant_context, C_q_dot);
}
...
for (int i = 0; i < H; i++) {
prog.AddConstraint( M * q_ddot[i] + C_q_dot + G >= lb );
prog.AddConstraint( M * q_ddot[i] + C_q_dot + G <= ub );
}
// solve prog
...
The above code will not work, because plant.SetPositionsAndVelocities(.) doesn't accept symbolic variables.
Is there any way to integrate M,C in my ocp constraints ?
I think you want to impose the following nonlinear nonconvex constraint
lb <= M * qddot + C(q, v) + g(q) <= ub
This constraint is non-convex. We will need to solve it through nonlinear optimization, and evaluate the constraint in every iteration of the nonlinear optimization. We can't do this evaluation using symbolic computation (it would be horribly slow with symbolic computation).
So you will need a constraint evaluator, something like this
// This constraint takes [q;v;vdot] and evaluate
// M * vdot + C(q, v) + g(q)
class MyConstraint : public solvers::Constraint {
public:
MyConstraint(const MultibodyPlant<AutoDiffXd>& plant, systems::Context<AutoDiffXd>* context, const Eigen::Ref<const Eigen::VectorXd>& lb, const Eigen::Ref<const Eigen::VectorXd>& ub) : solvers::Constraint(plant.num_velocitiex(), plant.num_positions() + 2 * plant.num_velocities(), lb, ub), plant_{plant}, context_{context} {
...
}
private:
void DoEval(const Eigen::Ref<const AutoDiffVecXd>& x, AutoDiffVecXd* y) const {
...
}
MultibodyPlant<AutoDiffXd> plant_;
systems::Context<AutoDiffXd>* context_;
};
int main() {
...
// Construct the constraint and add it to every time instances
std::vector<std::unique_ptr<systems::Context<AutoDiffXd>>> plant_contexts;
for (int i = 0; i < H; ++i) {
plant_contexts.push_back(plant.CreateDefaultContext());
prog.AddConstraint(std::make_shared<MyConstraint>(plant, plant_context[i], lb, ub), {q_v[i], qddot[i]});
}
}
You could refer to the class CentroidalMomentumConstraint on how to construct your own MyConstraint class.
I have the following method called pgain which calls the method dist that I am trying to parallize:
/******************************************************************************/
/* For a given point x, find the cost of the following operation:
* -- open a facility at x if there isn't already one there,
* -- for points y such that the assignment distance of y exceeds dist(y, x),
* make y a member of x,
* -- for facilities y such that reassigning y and all its members to x
* would save cost, realize this closing and reassignment.
*
* If the cost of this operation is negative (i.e., if this entire operation
* saves cost), perform this operation and return the amount of cost saved;
* otherwise, do nothing.
*/
/* numcenters will be updated to reflect the new number of centers */
/* z is the facility cost, x is the number of this point in the array
points */
double pgain ( long x, Points *points, double z, long int *numcenters )
{
int i;
int number_of_centers_to_close = 0;
static double *work_mem;
static double gl_cost_of_opening_x;
static int gl_number_of_centers_to_close;
int stride = *numcenters + 2;
//make stride a multiple of CACHE_LINE
int cl = CACHE_LINE/sizeof ( double );
if ( stride % cl != 0 ) {
stride = cl * ( stride / cl + 1 );
}
int K = stride - 2 ; // K==*numcenters
//my own cost of opening x
double cost_of_opening_x = 0;
work_mem = ( double* ) malloc ( 2 * stride * sizeof ( double ) );
gl_cost_of_opening_x = 0;
gl_number_of_centers_to_close = 0;
/*
* For each center, we have a *lower* field that indicates
* how much we will save by closing the center.
*/
int count = 0;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
center_table[i] = count++;
}
}
work_mem[0] = 0;
//now we finish building the table. clear the working memory.
memset ( switch_membership, 0, points->num * sizeof ( bool ) );
memset ( work_mem, 0, stride*sizeof ( double ) );
memset ( work_mem+stride,0,stride*sizeof ( double ) );
//my *lower* fields
double* lower = &work_mem[0];
//global *lower* fields
double* gl_lower = &work_mem[stride];
#pragma omp parallel for
for ( i = 0; i < points->num; i++ ) {
float x_cost = dist ( points->p[i], points->p[x], points->dim ) * points->p[i].weight;
float current_cost = points->p[i].cost;
if ( x_cost < current_cost ) {
// point i would save cost just by switching to x
// (note that i cannot be a median,
// or else dist(p[i], p[x]) would be 0)
switch_membership[i] = 1;
cost_of_opening_x += x_cost - current_cost;
} else {
// cost of assigning i to x is at least current assignment cost of i
// consider the savings that i's **current** median would realize
// if we reassigned that median and all its members to x;
// note we've already accounted for the fact that the median
// would save z by closing; now we have to subtract from the savings
// the extra cost of reassigning that median and its members
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost - x_cost;
}
}
// at this time, we can calculate the cost of opening a center
// at x; if it is negative, we'll go through with opening it
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
double low = z + work_mem[center_table[i]];
gl_lower[center_table[i]] = low;
if ( low > 0 ) {
// i is a median, and
// if we were to open x (which we still may not) we'd close i
// note, we'll ignore the following quantity unless we do open x
++number_of_centers_to_close;
cost_of_opening_x -= low;
}
}
}
//use the rest of working memory to store the following
work_mem[K] = number_of_centers_to_close;
work_mem[K+1] = cost_of_opening_x;
gl_number_of_centers_to_close = ( int ) work_mem[K];
gl_cost_of_opening_x = z + work_mem[K+1];
// Now, check whether opening x would save cost; if so, do it, and
// otherwise do nothing
if ( gl_cost_of_opening_x < 0 ) {
// we'd save money by opening x; we'll do it
for ( int i = 0; i < points->num; i++ ) {
bool close_center = gl_lower[center_table[points->p[i].assign]] > 0 ;
if ( switch_membership[i] || close_center ) {
// Either i's median (which may be i itself) is closing,
// or i is closer to x than to its current median
points->p[i].cost = points->p[i].weight * dist ( points->p[i], points->p[x], points->dim );
points->p[i].assign = x;
}
}
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] && gl_lower[center_table[i]] > 0 ) {
is_center[i] = false;
}
}
if ( x >= 0 && x < points->num ) {
is_center[x] = true;
}
*numcenters = *numcenters + 1 - gl_number_of_centers_to_close;
} else {
gl_cost_of_opening_x = 0; // the value we'll return
}
free ( work_mem );
return -gl_cost_of_opening_x;
}
The function that I am trying to parallelize:
/* compute Euclidean distance squared between two points */
float dist ( Point p1, Point p2, int dim )
{
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
return ( result );
}
With Point being this:
/* this structure represents a point */
/* these will be passed around to avoid copying coordinates */
typedef struct {
float weight;
float *coord;
long assign; /* number of point where this one is assigned */
float cost; /* cost of that assignment, weight*distance */
} Point;
I have a large application of streamcluster(815 lines of code) that produces real time numbers and sorts them in a specific way. I have used scalasca tool on Linux so I can measure the methods that take up most of the time and I have found that method dist listed above is the most time-consuming. I am trying to use openMP tools but the time that the parallelized code runs is more than the time the serial code. If serial code runs in 1,5 sec the parallelized takes 20 but the results are the same. And I am wondering is it that I can't parallelize this part of code for some reason or that I don't do it correctly.
The method I am trying to parallelize its in a call tree: main->pkmedian->pFL->pgain->dist (-> means that calls the following method)
The code you've chosen to parallelize:
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
is a poor candidate to benefit from parallelization. You should not use parallel for here. You should probably not use parallelization on an inner loop. If you can parallelize some outer loop, you're much more like to see gains.
There is an overhead to coordinate the thread team to start the parallel region and another overhead for performing the reduction afterwards. Meanwhile, the parallel region's contents take essentially no time to run. Given that, you'd need dim to be extremely large before you'd expect this to give a performance benefit.
To express that point more graphically, consider that the math you're doing will take nanoseconds and compare it against this chart showing the overhead of various OpenMP directives.
If you need this to run faster, your first stop should be to use appropriate compilation flags, followed by looking into SIMD operations: SSE and AVX are good keywords. Your compiler might even invoke them automatically.
I've built some test code (see below) and compiled it with various optimizations enabled, as listed below, and run it on arrays of 100,000 elements. Note that enabling -O3 results in a run-time that is on the order of the OpenMP directives. This implies that you'd want arrays of about 400,000 before you'd want to think about using OpenMP and probably more like 1,000,000, to be safe.
No optimizations. Run-time is ~1900μs.
-O3: Enables many optimizations. Run-time is ~200μs.
-ffast-math: You want this, unless you're doing some very tricky things. Run-time is about the same.
-march=native: Compile code to use the full capabilities of your CPU, rather than a generic instruction set that would work on many CPUs. Run-time is ~100μs.
So there we go, strategic use of compiler options (-march=native) can double the speed of the code in question without having to muck about in parallelism.
Here is a handy slide presentation with some tips explaining how to use OpenMP in a performant manner.
Test code:
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
int main(){
std::vector<double> a;
std::vector<double> b;
for(int i=0;i<100000;i++){
a.push_back(rand()/(double)RAND_MAX);
b.push_back(rand()/(double)RAND_MAX);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
float result = 0.0;
//#pragma omp parallel for reduction(+:result)
for (unsigned int i=0; i<a.size(); i++ )
result += ( a[i] - b[i] ) * ( a[i] - b[i] );
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds"<<std::endl;
}
I wrote a very simple function in GSL, to select a submatrix from an existing matrix in a struct.
EDIT: I had timed VERY INCORRECTLY and didn't notice the changed number of zeros in front.Still, I hope this can be sped up
For 100x100 submatrices of a 10000x10000 matrix, it takes 1.2E-5 seconds. So, repeating that 1E4 times, takes 50 times longer than I need to diagonalise the 100x100 matrix.
EDIT:
I realise, it happens even if I comment out everything except return(0);
Thus, I theorize, it must be something about struct TOWER. This is how TOWER looks:
struct TOWER
{
int array_level[TOWERSIZE];
int array_window[TOWERSIZE];
gsl_matrix *matrix_ordered_covariance;
gsl_matrix *matrix_peano_covariance;
double array_angle_tw[XISTEP];
double array_correl_tw[XISTEP];
gsl_interp_accel *acc_correl; // interpolating for correlation
gsl_spline *spline_correl;
double array_all_eigenvalues[TOWERSIZE]; //contains all eiv. of whole matrix
std::vector< std::vector<double> > cropped_peano_covariance, peano_mask;
};
Below comes my function!
/* --- --- */
int monolevelsubmatrix(int i, int j, struct TOWER *tower, gsl_matrix *result) //relying on spline!! //must addd auto vanishing
{
int firstrow, firstcol,mu,nu,a,b;
double aux, correl;
firstrow = helix*i;
firstcol = helix*j;
gsl_matrix_view Xi = gsl_matrix_submatrix (tower ->matrix_ordered_covariance, firstrow, firstcol, helix, helix);
gsl_matrix_memcpy (result, &(Xi.matrix));
return(0);
}
/* --- --- */
The problem is almost certainly gls_matric_memcpy. The source for that is in copy_source.c, with:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
for (i = 0; i < src_size1 ; i++)
{
for (j = 0; j < MULTIPLICITY * src_size2; j++)
{
dest->data[MULTIPLICITY * dest_tda * i + j]
= src->data[MULTIPLICITY * src_tda * i + j];
}
}
This would be quite slow. Note that gls_matrix_memcpy returns a GLS_ERROR if the matrices are different sizes, so it's very likely the data member could be served with a CRT memcpy on the data members of dest and src.
This loop is very slow. Each cell is derefence through dest & src structs for the data member, and THEN indexed.
You could choose to write a replacement for the library, or write your own personal version of this matrix copy, with something like (untested suggestion code here):
unsigned int cellsize = sizeof( src->data[0] ); // just psuedocode here
memcpy( dest->data, src->data, cellsize * src_size1 * src_size2 * MULTIPLICITY )
Note that MULTIPLICITY is a define, usually 1 or 2, probably depends on library configuration - might not apply to your usage (if it's 1 )
Now, important caveat....if the source matrix is a subview, then you have to go by rows...that is, a loop of rows in i where crt's memcpy is limited to rows at a time, not the entire matrix as I show above.
In other words, you do have to account for the source matrix geometry from which the subview was taken...that's probably why they index each cell (makes it simple).
If, however, you KNOW the geometry, you can very likely optimize this WAY above the performance you're seeing.
If all you did was take out the src/dest derefence, you'd see SOME performance gain, as in:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
float * dest_data = dest->data; // psuedocode here
float * src_data = src->data; // psuedocode here
for (i = 0; i < src_size1 ; i++)
{
for (j = 0; j < MULTIPLICITY * src_size2; j++)
{
dest_data[MULTIPLICITY * dest_tda * i + j]
= src_data[MULTIPLICITY * src_tda * i + j];
}
}
We'd HOPE the compiler recognized that anyway, but...sometimes...
I'm currently implementing the paper of Revelles, Urena and Lastra "An Efficient Parametric Algorithm for Octree Traversal". In Ray - Octree intersection algorithms someone implemented it and pasted his code. My implementation should be the same, except that I used some vectors for computation.
However using this Octree only the upper right part of the image is rendered, for the rest of the image the octree isn't traversed. The check wheter to traverse or not happens in the following method:
bool Octnode::intersect( Ray r, SurfaceData *sd )
{
unsigned int a = 0;
v3d o = r.origin();
v3d d = r.direction();
if ( r.direction()[0] < 0. ) {
o[0] = _size[0] - r.origin()[0];
d[0] = -r.direction()[0];
a |= 4;
}
if ( r.direction()[1] < 0. ) {
o[1] = _size[1] - r.origin()[1];
d[1] = -r.direction()[1];
a |= 2;
}
if ( r.direction()[2] < 0. ) {
o[2] = _size[2] - r.origin()[2];
d[2] = -r.direction()[2];
a |= 1;
}
v3d t0 = ( _min - o ) / d;
v3d t1 = ( _max - o ) / d;
scalar t = std::numeric_limits<double>::max();
// traversal -- if any -- starts here
if ( t0.max() < t1.min() ) {
return processSubtree( t0, t1, r, &t, sd, a );
} else {
return false;
}
}
[Edit] The above method implements the function
void ray_parameter( octree *oct, ray r )
from the paper. As C. Urena pointed out there is an error in the paper that causes the traversal to be incorrect. Unfortunately traversal is skipped before this error could come into play.
In the Google group that can be found follwing C. Urena's link it seems the size of an octree node is computed differently. I did:
_size = _max - _min;
versus
_size = ( _max - _min ) / 2.;
in the Google group. I'll test that and post another update. [/Edit]
[Edit 2] Applying the fix that Carlos mentioned and reducing the size by half brought me this far:
The spheres should be completely rendered, but at least not all rays for the upper left quarter are rejected. [/Edit 2]
[Edit 3] Using different data sets I get seemingly better results, looks like I'll have to investigate some other parts of the code.
[/Edit 3]
I have no time for a detailed review of your code, but perhaps you should check for an error in the original paper which may be also present in your code: you can see it described here: http://lsi.ugr.es/curena/inves/wscg00/ -- there's a pointer to a a google group with the discussion.
Hope this help,
Carlos.
I'm trying to figure out the vDSP functions and the results I'm getting are very strange.
This is related to this question:
Using std::complex with iPhone's vDSP functions
Basically I am trying to make sense of vDSP_vdist as I start off with a vector of std::complex< float >. Now AFAIK I should be able to calculate the magnitude by, simply, doing:
// std::abs of a complex does sqrtf( r^2 + i^2 ).
pOut[idx] = std::abs( pIn[idx] );
However when I do this I see the spectrum reflected around the midpoint of the vector. This is very strange.
Oddly, however, if I use a vDSP_ztoc followed by a vDSP_vdist I get exactly the results I expect. So I wrote a bit of code to try and understand whats going wrong.
bool VecMagnitude( float* pOut, const std::complex< float >* pIn, unsigned int num )
{
std::vector< float > realTemp( num );
std::vector< float > imagTemp( num );
DSPSplitComplex dspsc;
dspsc.realp = &realTemp.front();
dspsc.imagp = &imagTemp.front();
vDSP_ctoz( (DSPComplex*)pIn, 1, &dspsc, 1, num );
int idx = 0;
while( idx < num )
{
if ( fabsf( dspsc.realp[idx] - pIn[idx].real() ) > 0.0001f ||
fabsf( dspsc.imagp[idx] - pIn[idx].imag() ) > 0.0001f )
{
char temp[256];
sprintf( temp, "%f, %f - %f, %f", dspsc.realp[idx], dspsc.imagp[idx], pIn[idx].real(), pIn[idx].imag() );
fprintf( stderr, temp );
}
}
return true;
}
Now whats strange is the above code starts failing when idx = 1 and continues to the end. The reason is that dspsc.realp[1] == pIn[0].imag(). Its like instead of splitting it into 2 different buffers that it has straight memcpy'd half the vector of std::complexes into dspsc.realp. ie the 2 floats at std::complex[0] then the 2 floats in std::complex[1] and so on. dspsc.imagp is much the same. dspsc.imagp[1] = pIn[1].real().
This just makes no sense. Can someone explain where on earth I'm failing to understand whats going on?