I am trying to optimize my code by implementing for loops on threads of the GPU. I am trying to eliminate two for loops using thrust::transform. The code in C++ looks like:
ka_index = 0;
for (int i = 0; i < N_gene; i++)
{
for (int j = 0; j < n_ka_d[i]; j++ )
{
co0 = get_coeff0(ka_vec_d[ka_index]);
act[i] += (co0*ka_val_d[ka_index]);
ka_index++;
}
act[i] = pow(act[i],n);
}
I am estimating co-efficients for an ordinary differential equation(ODE) in the above loops
and have transferred all the data onto the device using thrust. Consider the case where the number of genes is represented by N_gene. The fist for loop has to run N_gene number of times. The second for loop is restricted by the number of activators(other friendly genes in the gene pool) of each gene. Each gene has a number of activators(friendly genes whose presence increases the concentration of gene i) represented by elements of n_ka vector. Value of n_ka[i] can vary from 0 to N_gene - 1. ka_val represents the measure of activation for each activator ka. ka_vec_d has the gene index which activates gene i.
I am trying to represent these loops using iterators, but unable to do so. I am familiar with using thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple)) for a single for loop, but having a tough time coming up with a way to implement two for loops using counting_iterator or transform iterators. Any pointers or help to convert these two for loops will be appreciated. Thanks for your time!
This looks like a reduce problem. I think you can use thrust::transform with zip iterators and thrust::reduce_by_key. A sketch of this solution is:
// generate indices
std::vector< int > hindices;
for( size_t i=0 ; i<N_gene ; ++i )
for( size_t j=0 ; j<n_ka_d[i] ; ++j )
hindices.push_back( i );
thrust::device_vector< int > indices = hindices;
// generate tmp
// trafo1 implements get_coeff0( get< 0 >( t ) ) * get< 1 >( t);
thrust::device_vector< double > tmp( N );
thrust::transform(
thrust::make_zip_iterator(
thrust::make_tuple( ka_vec_d.begin() , ka_val_d.begin() ) ) ,
thrust::make_zip_iterator(
thrust::make_tuple( ka_vec_d.end() , ka_val_d.end() ) ) ,
tmp.begin() , trafo1 );
// do the reduction for each ac[i]
thrust::device_vector< int > indices_out( N );
thrust::reduce_by_key( indices.begin() , indices.end() , tmp.begin() ,
ac.begin() , indices_out.begin() );
// do the pow transformation
thrust::transform( ac.begin() , ac.end() , ac.begin() , pow_trafo );
I this this can also be optimized by transform_iterators to reduce the number of calls of thrust::transform and thrust::recuce_by_key.
Related
Using boost c++ odeint library, is it possible to solve a second order differential equation defined as follows ?
m*x''[i] + x'[i] = K*\sum{j=1,N} sin(x[j] - x[i]), where i = 1,2,3..N.
m = 1, K = 1
where initial value of x is an vector or array of N uniformly generated random numbers between 0 to 2*pi.
I want to integrate above equation using runge_kutta stepper of odeint ?
I can solve it by writing above eqn. in two first order differential equations, but
then in that case how the odeint stepper's would be written or modified ?
Just transform your equations to a first order ODE and use a state type of length 2 N. The first N entries now handle only the x[i] while the second N entries refer to the velocities x'[i]
void ode( state_type const& x , state_type &dxdt , double t )
{
for( size_t i=0 ; i<N ; ++i )
{
double sum = 0.0;
// calculate sum
dxdt[i] = x[i+N];
dxdt[i+N] = K * sum;
}
}
A complete example might look like
size_t N = 512;
typedef std::vector< double > state_type;
state_type x( 2 * N );
// initialize x
double t_start = 0.0 , t_end = 10.0 , dt = 0.01;
odeint::integrate( ode , x , t_start , t_end , dt );
I'd like to create a not squared set of images using ccfits. I can make a single one in the primaryHDU, like this:
long axes[2] = { jmax, imax };
std::auto_ptr<CCfits::FITS> pFits(0);
pFits.reset ( new CCfits::FITS ( "fitfile.fits", FLOAT_IMG, 2, axes ) );
std::valarray<double> h2a0array ( jmax * imax );
for ( int i = 0 ; i < imax ; i++
for ( int j = 0 ; j < jmax ; j++ )
h2a0array [ j + jmax * i ] = i + j;
pFits->pHDU().write ( fpixel, imax * jmax, h2a0array );
But I don't know how to add other not-squared images to my FITS file. I guess I have to use the CCFITS::addImage function, but can only obtain squared images using it:
long fpixel ( 1 );
std::vector<long> extAx ( 2, dim );
CCfits::ExtHDU* imageExt2 = pFits->addImage ( "h2a0array", FLOAT_IMG, extAx );
imageExt2->write ( fpixel, imax * jmax, h2a0array );
The extAx vector contains only two values, the first is the dimension (1D, 2D, 3D) of the image to add to the FITS file, the second is its size. I don't know any other way to add an image to a FITS file. If someone does, your help is strongly welcome!
Thanks,
Arnaud.
The vector of the last argument of addImage can have any dimension
and different axis lengths in the dimensions. There is no requirement that
the axis lengths are the same ("square" as you seem to call it):
vector<long> extAx ;
extAx.push_back(imax) ;
extAx.push_back(jmax) ;
extAx.push_back(kmax) ;
pFits->addImage("h2a0array", FLOAT_IMG, extAx );
I have a Polynomial class that has a get_vect member function which stores integers in a vector that is to be the representation of the coefficients of the polynomial. Now, I am trying to multiply two polynomials together using a Multiply non-member function, but I get stuck when it comes to the actual multiplication of the vectors. So far, what I have is what is shown below:
Polynomial Multiply(const Polynomial & poly1, const Polynomial & poly2)
{
vector<int> Poly1 = poly1.get_vect();
vector<int> Poly2 = poly2.get_vect();
vector<int> Poly3;
if( Poly1.size() < Poly2.size() )
{
for(size_t i = 0 ; Poly2.size()-Poly1.size() ; ++i )
{
Poly2.push_back(0);
}
}
else if( Poly1.size() > Poly2.size() )
{
for(size_t i = 0 ; Poly1.size()-Poly2.size() ; ++i )
{
Poly1.push_back(0);
}
}
return Poly3;
}
I see that it some how has to follow the below pattern:
Ok, so if I understand the problem correctly, you want Poly3 to be a vector<int> that holds the coefficients that result from a polynomial multiplication between the polynomials represented by Poly1 and Poly2.
Tacit in this request is that all three polynomials are polynomials in a single variable, with each coefficient representing the coefficient in front of an increasing power of that variable. ie. that { 4, 5, 6, 7 } corresponds to 4 + 5x + 6x2 + 7x3.
If so, then the actual multiplication shouldn't be that difficult at all, as long as your polynomials aren't terribly huge. You need code that looks approximately like this:
Poly3.resize(Poly1.size() + Poly2.size() - 1, 0); // Make the output big enough; init to 0
for (size_t i = 0; i != Poly1.size(); i++)
for (size_t j = 0; j != Poly2.size(); j++)
Poly3[i+j] += Poly1[i] * Poly2[j];
Now the result in Poly3 should be the product of Poly1 and Poly2.
It's entirely possible I forgot an edge condition; I'll watch for comments here to point out where I did. In the meantime, though, I did a few tests and it appears this gives the correct output.
If you have rather large polynomials, then you might want to look into math libraries to handle the multiplication. But for anything under about 20 - 30 terms? Unless your code leans very hard on this polynomial evaluation, I suspect this won't be your bottleneck.
The Short Version
In the following line:
aData[i] = aData[i] + ( aOn * sin( i ) );
If aOn is 0 or 1, does the processor actually perform the multiplication, or does it conditionally work out the result (0 for 0, other-value for 1)?
The Long Version
I'm looking into algorithm performance consistency, which partly involves a look into the effect of Branch Prediction.
The hypothesis is that this code:
for ( i = 0; i < iNumSamples; i++ )
aData[i] = aData[i] + ( aOn * sin( i ) );
will provide more stable performance than this code (where branch prediction may destabilise performance):
for ( i = 0; i < iNumSamples; i++ )
{
if ( aOn )
aData[i] = aData[i] + sin( i );
}
with aOn being either 0 or 1, and it can toggle during the loop execution by another thread.
The actual conditional calculation (+ sin( i ) in the example above) involves more processing and the if condition must be within the loop (there are multitude of conditions, not just one like in the example above; also, changes to aOn should have effect immediately and not per loop).
Ignoring performance consistency, the performance tradeoff between the two options is in the time it takes to execute the if statement and that of a multiplication.
Regardless, it is easy to spot that if a processor would not perform the actual multiplication for values like 1 and 0, the first option could be a win-win solution (no branch prediction, better performance).
Processors perform regular multiplication with 0s and 1s.
Reason is, that if the processor would check for 0 and 1 before each calculation, the introduction of the condition will take more cycles. While you would gain performance for 0 and 1 multipliers, you will lose performance for any other values (which are much more probable).
A simple program can prove this:
#include <iostream>
#include "cycle.h"
#include "time.h"
void Loop( float aCoefficient )
{
float iSum = 0.0f;
clock_t iStart, iEnd;
iStart = clock();
for ( int i = 0; i < 100000000; i++ )
{
iSum += aCoefficient * rand();
}
iEnd = clock();
printf("Coefficient: %f: %li clock ticks\n", aCoefficient, iEnd - iStart );
}
int main(int argc, const char * argv[])
{
Loop( 0.0f );
Loop( 1.0f );
Loop( 0.25f );
return 0;
}
For which the output is:
Coefficient: 0.000000: 1380620 clock ticks
Coefficient: 1.000000: 1375345 clock ticks
Coefficient: 0.250000: 1374483 clock ticks
Basic Question: I have a k dimensional box. I have a vector of upper bounds and lower bounds. What is the most efficient way to enumerate the coordinates of the vertices?
Background: As an example, say I have a 3 dimensional box. What is the most efficient algorithm / code to obtain:
vertex[0] = ( 0, 0, 0 ) -> ( L_0, L_1, L_2 )
vertex[1] = ( 0, 0, 1 ) -> ( L_0, L_1, U_2 )
vertex[2] = ( 0, 1, 0 ) -> ( L_0, U_1, L_2 )
vertex[3] = ( 0, 1, 1 ) -> ( L_0, U_1, U_2 )
vertex[4] = ( 1, 0, 0 ) -> ( U_0, L_1, L_2 )
vertex[5] = ( 1, 0, 1 ) -> ( U_0, L_1, U_2 )
vertex[6] = ( 1, 1, 0 ) -> ( U_0, U_1, L_2 )
vertex[7] = ( 1, 1, 1 ) -> ( U_0, U_1, U_2 )
where L_0 corresponds to the 0'th element of the lower bound vector & likewise U_2 is the 2nd element of the upper bound vector.
My Code:
const unsigned int nVertices = ((unsigned int)(floor(std::pow( 2.0, double(nDimensions)))));
for ( unsigned int idx=0; idx < nVertices; ++idx )
{
for ( unsigned int k=0; k < nDimensions; ++k )
{
if ( 0x00000001 & (idx >> k) )
{
bound[idx][k] = upperBound[k];
}
else
{
bound[idx][k] = lowerBound[k];
}
}
}
where the variable bound is declared as:
std::vector< std::vector<double> > bound(nVertices);
but I've pre-sized it so as not to waste time in the loop allocating memory. I need to call the above procedure about 50,000,000 times every time I run my algorithm -- so I need this to be really efficient.
Possible Sub-Questions: Does it tend to be faster to shift by k instead of always shifting by 1 and storing an intermediate result? (Should I be using >>= ??)
It will probably go faster if you can reduce conditional branching:
bound[idx][k] = upperLowerBounds[(idx >> k) & 1][k];
You might improve things even more if you can interleave the upper and lower bounds in a single array:
bound[idx][k] = upperLowerBounds[(k << 1) | (idx >> k)&1];
I don't know if shifting idx incrementally helps. It's simple enough to implement, so it's worth a try.