OpenMp parallel for

OpenMp parallel for - c++

I have the following method called pgain which calls the method dist that I am trying to parallize:
/******************************************************************************/
/* For a given point x, find the cost of the following operation:
* -- open a facility at x if there isn't already one there,
* -- for points y such that the assignment distance of y exceeds dist(y, x),
* make y a member of x,
* -- for facilities y such that reassigning y and all its members to x
* would save cost, realize this closing and reassignment.
*
* If the cost of this operation is negative (i.e., if this entire operation
* saves cost), perform this operation and return the amount of cost saved;
* otherwise, do nothing.
*/
/* numcenters will be updated to reflect the new number of centers */
/* z is the facility cost, x is the number of this point in the array
points */
double pgain ( long x, Points *points, double z, long int *numcenters )
{
int i;
int number_of_centers_to_close = 0;
static double *work_mem;
static double gl_cost_of_opening_x;
static int gl_number_of_centers_to_close;
int stride = *numcenters + 2;
//make stride a multiple of CACHE_LINE
int cl = CACHE_LINE/sizeof ( double );
if ( stride % cl != 0 ) {
stride = cl * ( stride / cl + 1 );
}
int K = stride - 2 ; // K==*numcenters
//my own cost of opening x
double cost_of_opening_x = 0;
work_mem = ( double* ) malloc ( 2 * stride * sizeof ( double ) );
gl_cost_of_opening_x = 0;
gl_number_of_centers_to_close = 0;
/*
* For each center, we have a *lower* field that indicates
* how much we will save by closing the center.
*/
int count = 0;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
center_table[i] = count++;
}
}
work_mem[0] = 0;
//now we finish building the table. clear the working memory.
memset ( switch_membership, 0, points->num * sizeof ( bool ) );
memset ( work_mem, 0, stride*sizeof ( double ) );
memset ( work_mem+stride,0,stride*sizeof ( double ) );
//my *lower* fields
double* lower = &work_mem[0];
//global *lower* fields
double* gl_lower = &work_mem[stride];
#pragma omp parallel for
for ( i = 0; i < points->num; i++ ) {
float x_cost = dist ( points->p[i], points->p[x], points->dim ) * points->p[i].weight;
float current_cost = points->p[i].cost;
if ( x_cost < current_cost ) {
// point i would save cost just by switching to x
// (note that i cannot be a median,
// or else dist(p[i], p[x]) would be 0)
switch_membership[i] = 1;
cost_of_opening_x += x_cost - current_cost;
} else {
// cost of assigning i to x is at least current assignment cost of i
// consider the savings that i's **current** median would realize
// if we reassigned that median and all its members to x;
// note we've already accounted for the fact that the median
// would save z by closing; now we have to subtract from the savings
// the extra cost of reassigning that median and its members
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost - x_cost;
}
}
// at this time, we can calculate the cost of opening a center
// at x; if it is negative, we'll go through with opening it
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
double low = z + work_mem[center_table[i]];
gl_lower[center_table[i]] = low;
if ( low > 0 ) {
// i is a median, and
// if we were to open x (which we still may not) we'd close i
// note, we'll ignore the following quantity unless we do open x
++number_of_centers_to_close;
cost_of_opening_x -= low;
}
}
}
//use the rest of working memory to store the following
work_mem[K] = number_of_centers_to_close;
work_mem[K+1] = cost_of_opening_x;
gl_number_of_centers_to_close = ( int ) work_mem[K];
gl_cost_of_opening_x = z + work_mem[K+1];
// Now, check whether opening x would save cost; if so, do it, and
// otherwise do nothing
if ( gl_cost_of_opening_x < 0 ) {
// we'd save money by opening x; we'll do it
for ( int i = 0; i < points->num; i++ ) {
bool close_center = gl_lower[center_table[points->p[i].assign]] > 0 ;
if ( switch_membership[i] || close_center ) {
// Either i's median (which may be i itself) is closing,
// or i is closer to x than to its current median
points->p[i].cost = points->p[i].weight * dist ( points->p[i], points->p[x], points->dim );
points->p[i].assign = x;
}
}
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] && gl_lower[center_table[i]] > 0 ) {
is_center[i] = false;
}
}
if ( x >= 0 && x < points->num ) {
is_center[x] = true;
}
*numcenters = *numcenters + 1 - gl_number_of_centers_to_close;
} else {
gl_cost_of_opening_x = 0; // the value we'll return
}
free ( work_mem );
return -gl_cost_of_opening_x;
}
The function that I am trying to parallelize:
/* compute Euclidean distance squared between two points */
float dist ( Point p1, Point p2, int dim )
{
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
return ( result );
}
With Point being this:
/* this structure represents a point */
/* these will be passed around to avoid copying coordinates */
typedef struct {
float weight;
float *coord;
long assign; /* number of point where this one is assigned */
float cost; /* cost of that assignment, weight*distance */
} Point;
I have a large application of streamcluster(815 lines of code) that produces real time numbers and sorts them in a specific way. I have used scalasca tool on Linux so I can measure the methods that take up most of the time and I have found that method dist listed above is the most time-consuming. I am trying to use openMP tools but the time that the parallelized code runs is more than the time the serial code. If serial code runs in 1,5 sec the parallelized takes 20 but the results are the same. And I am wondering is it that I can't parallelize this part of code for some reason or that I don't do it correctly.
The method I am trying to parallelize its in a call tree: main->pkmedian->pFL->pgain->dist (-> means that calls the following method)

The code you've chosen to parallelize:
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
is a poor candidate to benefit from parallelization. You should not use parallel for here. You should probably not use parallelization on an inner loop. If you can parallelize some outer loop, you're much more like to see gains.
There is an overhead to coordinate the thread team to start the parallel region and another overhead for performing the reduction afterwards. Meanwhile, the parallel region's contents take essentially no time to run. Given that, you'd need dim to be extremely large before you'd expect this to give a performance benefit.
To express that point more graphically, consider that the math you're doing will take nanoseconds and compare it against this chart showing the overhead of various OpenMP directives.
If you need this to run faster, your first stop should be to use appropriate compilation flags, followed by looking into SIMD operations: SSE and AVX are good keywords. Your compiler might even invoke them automatically.
I've built some test code (see below) and compiled it with various optimizations enabled, as listed below, and run it on arrays of 100,000 elements. Note that enabling -O3 results in a run-time that is on the order of the OpenMP directives. This implies that you'd want arrays of about 400,000 before you'd want to think about using OpenMP and probably more like 1,000,000, to be safe.
No optimizations. Run-time is ~1900μs.
-O3: Enables many optimizations. Run-time is ~200μs.
-ffast-math: You want this, unless you're doing some very tricky things. Run-time is about the same.
-march=native: Compile code to use the full capabilities of your CPU, rather than a generic instruction set that would work on many CPUs. Run-time is ~100μs.
So there we go, strategic use of compiler options (-march=native) can double the speed of the code in question without having to muck about in parallelism.
Here is a handy slide presentation with some tips explaining how to use OpenMP in a performant manner.
Test code:
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
int main(){
std::vector<double> a;
std::vector<double> b;
for(int i=0;i<100000;i++){
a.push_back(rand()/(double)RAND_MAX);
b.push_back(rand()/(double)RAND_MAX);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
float result = 0.0;
//#pragma omp parallel for reduction(+:result)
for (unsigned int i=0; i<a.size(); i++ )
result += ( a[i] - b[i] ) * ( a[i] - b[i] );
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds"<<std::endl;
}

Related

Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.

Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
{
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
{
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
}
}
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
{
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
{
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
}
_mm256_storeu_pd( rdi + i, v );
}
}
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.

If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
res.resize(n_bins);
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
res[offsets[i]].push_back(i);
}
return res;
}
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
}
acc[row] = sum;
}
}

Implementing De Boors algorithm for finding points on a B-spline

I've been working on this for several weeks but have been unable to get my algorithm working properly and i'm at my wits end. Here's an illustration of what i have achieved:
If everything was working i would expect a perfect circle/oval at the end.
My sample points (in white) are recalculated every time a new control point (in yellow) is added. At 4 control points everything looks perfect, again as i add a 5th on top of the 1st things look alright, but then on the 6th it starts to go off too the side and on the 7th it jumps up to the origin!
Below I'll post my code, where calculateWeightForPointI contains the actual algorithm. And for reference- here is the information i'm trying to follow. I'd be so greatful if someone could take a look for me.
void updateCurve(const std::vector<glm::vec3>& controls, std::vector<glm::vec3>& samples)
{
int subCurveOrder = 4; // = k = I want to break my curve into to cubics
// De boor 1st attempt
if(controls.size() >= subCurveOrder)
{
createKnotVector(subCurveOrder, controls.size());
samples.clear();
for(int steps=0; steps<=20; steps++)
{
// use steps to get a 0-1 range value for progression along the curve
// then get that value into the range [k-1, n+1]
// k-1 = subCurveOrder-1
// n+1 = always the number of total control points
float t = ( steps / 20.0f ) * ( controls.size() - (subCurveOrder-1) ) + subCurveOrder-1;
glm::vec3 newPoint(0,0,0);
for(int i=1; i <= controls.size(); i++)
{
float weightForControl = calculateWeightForPointI(i, subCurveOrder, controls.size(), t);
newPoint += weightForControl * controls.at(i-1);
}
samples.push_back(newPoint);
}
}
}
//i = the weight we're looking for, i should go from 1 to n+1, where n+1 is equal to the total number of control points.
//k = curve order = power/degree +1. eg, to break whole curve into cubics use a curve order of 4
//cps = number of total control points
//t = current step/interp value
float calculateWeightForPointI( int i, int k, int cps, float t )
{
//test if we've reached the bottom of the recursive call
if( k == 1 )
{
if( t >= knot(i) && t < knot(i+1) )
return 1;
else
return 0;
}
float numeratorA = ( t - knot(i) );
float denominatorA = ( knot(i + k-1) - knot(i) );
float numeratorB = ( knot(i + k) - t );
float denominatorB = ( knot(i + k) - knot(i + 1) );
float subweightA = 0;
float subweightB = 0;
if( denominatorA != 0 )
subweightA = numeratorA / denominatorA * calculateWeightForPointI(i, k-1, cps, t);
if( denominatorB != 0 )
subweightB = numeratorB / denominatorB * calculateWeightForPointI(i+1, k-1, cps, t);
return subweightA + subweightB;
}
//returns the knot value at the passed in index
//if i = 1 and we want Xi then we have to remember to index with i-1
float knot(int indexForKnot)
{
// When getting the index for the knot function i remember to subtract 1 from i because of the difference caused by us counting from i=1 to n+1 and indexing a vector from 0
return knotVector.at(indexForKnot-1);
}
//calculate the whole knot vector
void createKnotVector(int curveOrderK, int numControlPoints)
{
int knotSize = curveOrderK + numControlPoints;
for(int count = 0; count < knotSize; count++)
{
knotVector.push_back(count);
}
}

Your algorithm seems to work for any inputs I tried it on. Your problem might be a that a control point is not where it is supposed to be, or that they haven't been initialized properly. It looks like there are two control-points, half the height below the bottom left corner.

Fast percentile in C++

My program calculates a Monte Carlo simulation for the value-at-risk metric. To simplify as much as possible, I have:
1/ simulated daily cashflows
2/ to get a sample of a possible 1-year cashflow,
I need to draw 365 random daily cashflows and sum them
Hence, the daily cashflows are an empirically given distrobution function to be sampled 365 times. For this, I
1/ sort the daily cashflows into an array called *this->distro*
2/ calculate 365 percentiles corresponding to random probabilities
I need to do this simulation of a yearly cashflow, say, 10K times to get a population of simulated yearly cashflows to work with. Having the distribution function of daily cashflows prepared, I do the sampling like...
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
prob = (FLT_TYPE)fastrand(); // prob [0,1]
dIdx = prob * dMaxDistroIndex; // scale prob to distro function size
// to get an index into distro array
_floor = ((FLT_TYPE)(long)dIdx); // fast version of floor
_ceil = _floor + 1.0f; // 'fast' ceil:)
iIdx1 = (unsigned int)( _floor );
iIdx2 = iIdx1 + 1;
// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx );
generatedVal += this->distro[iIdx2]*(dIdx - _floor);
}
this->yearlyCashflows[idxSim] = generatedVal ;
}
The code inside of both for cycles does linear interpolation. If, say USD 1000 corresponds to prob=0.01, USD 10000 corresponds to prob=0.1 then if I don't have an empipirical number for p=0.05 I want to get USD 5000 by interpolation.
The question: this code runs correctly, though the profiler says that the program spends cca 60% of its runtime on the interpolation per se. So my question is, how can I make this task faster? Sample runtimes reported by VTune for specific lines are as follows:
prob = (FLT_TYPE)fastrand(); // 0.727s
dIdx = prob * dMaxDistroIndex; // 1.435s
_floor = ((FLT_TYPE)(long)dIdx); // 0.718s
_ceil = _floor + 1.0f; // -
iIdx1 = (unsigned int)( _floor ); // 4.949s
iIdx2 = iIdx1 + 1; // -
// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx ); // -
generatedVal += this->distro[iIdx2]*(dIdx - _floor); // 12.704s
Dashes mean the profiler reports no runtimes for those lines.
Any hint will be greatly appreciated.
Daniel
EDIT:
Both c.fogelklou and MSalters have pointed out great enhancements. The best code in line with what c.fogelklou said is
converter = distroDimension / (FLT_TYPE)(RAND_MAX + 1)
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
dIdx = (FLT_TYPE)fastrand() * converter;
iIdx1 = (unsigned long)dIdx);
_floor = (FLT_TYPE)iIdx1;
generatedVal += this->distro[iIdx1] + this->diffs[iIdx1] *(dIdx - _floor);
}
}
and the best I have along MSalter's lines is
normalizer = 1.0/(FLT_TYPE)(RAND_MAX + 1);
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
dIdx = (FLT_TYPE)fastrand()* normalizer ;
iIdx1 = fastrand() % _g.xDayCount;
generatedVal += this->distro[iIdx1];
generatedVal += this->diffs[iIdx1]*dIdx;
}
}
The second code is approx. 30 percent faster. Now, of 95s of total runtime, the last line consumes 68s. The last but one line consumes only 3.2s hence the double*double multiplication must be the devil. I thought of SSE - saving the last three operands into an array and then carry out a vector multiplication of this->diffs[i]*dIdx[i] and add this to this->distro[i] but this code ran 50 percent slower. Hence, I think I hit the wall.
Many thanks to all.
D.

This is a proposal for a small optimization, removing the need for ceil, two casts, and one of the multiplies. If you are running on a fixed point processor, that would explain why the multiplies and casts between float and int are taking so long. In that case, try using fixed point optimizations or turning on floating point in your compiler if the CPU supports it!
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
prob = (FLT_TYPE)fastrand(); // prob [0,1]
dIdx = prob * dMaxDistroIndex; // scale prob to distro function size
// to get an index into distro array
iIdx1 = (long)dIdx;
_floor = (FLT_TYPE)iIdx1; // fast version of floor
iIdx2 = iIdx1 + 1;
// interpolation per se
{
const FLT_TYPE diff = this->distro[iIdx2] - this->distro[iIdx1];
const FLT_TYPE interp = this->distro[iIdx1] + diff * (dIdx - _floor);
generatedVal += interp;
}
}
this->yearlyCashflows[idxSim] = generatedVal ;
}

I would recommend to fix fastrand. Floating-point code isn't the fastest in the world, but what is especially slow is the switching between floating point and integer code. Since you need an integer index, use an integer random function.
It may even be advantageous to pre-generate all 365 random values in a loop. Since you need only log2(dMaxDistroIndex) bits of randomness per value, you may be able to reduce the number of RNG calls.
You would subsequently pick a random number between 0 and 1 for the interpolation fraction.

Parallel Computing with OpenMP or _gnu_parallel does not speed up the code

I have this piece of code. I am trying to apply OpenMP, __gnu_parallel::for_each as well to make it parallel, but none of the methods are working.
What should I do?
Here make is a vector of sets and the type contained in the set is OctCell*.
The algorithm gives the correct output, but does not speed up the code. I have 4 cores.
void Oct :: applyFunction3(void (*Function)(OctCell* cell), unsigned int level)
{
__gnu_parallel::for_each(make.at(level).begin(),make.at(level).end(),Function);
}
The Function is
void directionalSweepX(OctCell* cell) {
OctCell* positiveCell,*negativeCell;
positiveCell = cell->getNeighbour(RIGHT);
negativeCell = cell->getNeighbour(LEFT);
addFluxToConserveds(cell, positiveCell, negativeCell, X);
}
The addFluxtoConserveds does the following
void addFluxToConserveds(OctCell* cell, OctCell* positiveCell, OctCell* negativeCell, SWEEP_DIRECTION direction) {
double deltaT = pow(2.0, cell->getLevel() - cell->getParentOct()->lMin)*gDeltaT;
// You have corrected that delta t is delta (L)
double alpha = (1 << (int) cell->getParentOct()->lMin) * gDeltaT/gL;// whats the purpose f <<
double beta = alpha/8.0;
double gamma;
double Flux[5] = {0.0, 0.0, 0.0, 0.0, 0.0};
if ( positiveCell == 0) {
Flux[direction+1] = getPressure(cell);
} else if ( positiveCell->isLeaf() ) {
computeFlux(cell, positiveCell, direction, Flux);
gamma = (positiveCell->getLevel() == cell->getLevel()) ? alpha : beta;
}
for (int i=0; i<5; i++) {
cell->mConserveds_n[i] -= alpha * Flux[i];
if (positiveCell) positiveCell->mConserveds_n[i] += gamma * Flux[i];
}
Flux[0] = Flux[1] = Flux[2] = Flux[3] = Flux[4] = 0.0;
if ( negativeCell == 0 ) {
Flux[direction+1] = getPressure(cell);
} else if (negativeCell->isLeaf() && negativeCell->getLevel() == cell->getLevel() - 1 ) {
computeFlux(negativeCell, cell, direction, Flux);
}
for (int i=0; i<5; i++) {
cell->mConserveds_n[i] += alpha * Flux[i];
if (negativeCell) negativeCell->mConserveds_n[i] -= beta * Flux[i];
}
}

use #include <omp.h>.
In the function addFluxtoConserveds you can add a #pragma omp for to the two for loops. This is because each iteration does not depend on the others to complete.
Because you have a secquential code that important to the second for loop, you can't work with sections or tasks in here.
What is the sequential implementation of applyFunction3 ?
You have to remember one critical thing about OpenMP. A program compiled on an architecture does not become optimized for every other architecture, even in the same family of processors (intel core duo vs intel dual core; intel vs amd; etc.).
This means it runs fast on the original architecture it was compiled and on the other ones it's just luck.

Different number of threads, different answers [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
So I have some neural network simulator code that works correctly on the CPU, and the parallel version agrees with the serial version to at least 6 decimal places with a 32-thread single block on both of my CUDA under Win7 PCs, but with 1 block and 64 threads slightly different values for Wt are generated. Wt values are often no more than 3 decimal places in agreement, and when I attempt to eliminate race conditions by embedding __syncthreads() within the loops, the Wt values appear as Not A Number when copied back to the CPU.
Can someone give me a hint what I might be doing wrong? I've included the parallelized code below, and knlBackProp is being called with lSampleQtyReq=10000, o=1, and Option='R':
// device-global variables to facilitate data transfer
__device__ __constant__ __align__(8) struct rohanContext devSes;
__device__ __constant__ struct rohanLearningSet devLearn;
__device__ __align__(16) struct rohanNetwork devNet;
__device__ double devdReturn[1024*1024];
__device__ double devdRMSE=0;
__device__ int devlReturn[1024*1024];
__device__ int devlTrainable=0;
extern"C"
int knlBackProp(struct rohanContext& rSes, long lSampleQtyReq, long o, char Option)
{mIDfunc /*! divides error in yielded values and back-propagates corrections among weights */
// Option S - single sample correction only
// Option E - keep existing weights, count trainable samples only
// Option R - perform corrections for all trainable samples
int lTotal=0;
cudaMemcpyToSymbol( "devlTrainable", &lTotal, sizeof(int) ); // init return value on both sides
mCheckCudaWorked
cudaEvent_t start, stop;
cudaEventCreate( &start);
cudaEventCreate( &stop);
cudaEventRecord( start, 0);
mtkBackPropMT<<< rSes.iBpropBlocks , rSes.iBpropThreads >>>( lSampleQtyReq, o, Option);
cudaEventRecord( stop, 0);
mCheckCudaWorked
cudaMemcpyFromSymbol( &lTotal, "devlTrainable", sizeof(long) ); // retrieve return value
mCheckCudaWorked
cudaEventSynchronize( stop);
float elapsedTime;
cudaEventElapsedTime( &elapsedTime, start, stop);
conPrintf("DEVICE: Time to complete BackProp kernel: %3.1f ms\n", elapsedTime);
cudaEventDestroy( start);
cudaEventDestroy( stop);
return lTotal;
}
__global__ __device__ void mtkBackPropMT( long lSampleQtyReq, long o, char Option)
{/*! divides error in yielded values and back-propagates corrections among weights */
// Option S - single sample correction only
// Option E - keep existing weights, count trainable samples only
// Option R - perform corrections for all trainable samples
if(Option=='E' || Option=='e'){ //
devlTrainable=0; // reset global mem trainable counter
subkBackPropEoptMT(lSampleQtyReq, o);
}
if(Option=='S' || Option=='s'){
devlTrainable=0; // reset global mem trainable counter
subkBackPropSoptMT(lSampleQtyReq, false, devNet, devNet.Signals, devNet.Zs, devNet.Wt, devNet.Deltas, devLearn.gpuXInputs, devLearn.gpuYEval, devLearn.gpudYEval);
}
if(Option=='R' || Option=='r'){ //
devlTrainable=0; // reset global mem trainable counter
subkBackPropRoptMT(lSampleQtyReq, o);
}
}
__device__ void subkBackPropRoptMT(long lSampleQtyReq, long o)
{/*! flags and counts samples meeting */
long OUTROWLEN=devLearn.iOutputQty+1; // prepare array index and width
//long tIx = threadIdx.x + devSes.iEvalThreads * blockIdx.x; // tIx is thread index over the kernel
long tIx = threadIdx.x + blockDim.x * blockIdx.x; // tIx is thread index over the kernel
//long lTotalThreads = devSes.iBpropThreads * devSes.iBpropBlocks; // total number of threads
double maxSquared = devSes.dMAX * devSes.dMAX ; //needed to compart to stored delta squared values
devlTrainable=0; // clear global mem accumulator; out of bound samples will remain at this value
for (long s=0; s<lSampleQtyReq; ++s){ // iterate over samples
if( devLearn.gpudSE1024[IDX2C( o, s, OUTROWLEN )] > maxSquared ){ // if the MAX criterion is exceeded
if(tIx==0)++devlTrainable; // increment the counter
subkBackPropSoptMT( s, true, devNet, devNet.Signals, devNet.Zs, devNet.Wt, devNet.Deltas, devLearn.gpuXInputs, devLearn.gpuYEval, devLearn.gpudYEval);
}
}
}
__device__ void subkBackPropSoptMT(long s, int o, rohanNetwork& Net, cuDoubleComplex * Signals, cuDoubleComplex * Zs, cuDoubleComplex * Wt, cuDoubleComplex * Deltas, cuDoubleComplex * XInputs, cuDoubleComplex * YEval, double * dYEval )
{/*! propagates adjustment of weights backwards preceeding layers from the chosen network output. */
// s is sample's index
// o is an optional method selection parameter; print/don't print as of 2/29/12
long index, kindex; // for warpwise loops
long tIx = threadIdx.x + blockDim.x * blockIdx.x; // tIx is thread index over the kernel
long lTotalThreads = gridDim.x * blockDim.x; // total number of threads
const cuDoubleComplex cdcZero = { 0, 0 };
/* clear all temp values BP0 */
for (long offset=0; (index =offset+tIx)< MAXNEURONS ; offset+=lTotalThreads){ // index stands for i
Deltas[index]=cdcZero;
Signals[index]=cdcZero;
Zs[index]=cdcZero;
}
/* re-evaluate sample to load temp values. BPI */
subkEvalSampleBetaMT( devSes, s, Net, (s==0), Signals, Zs, Wt, XInputs, YEval, dYEval);
/* begin error calculation. BPII */
cuDoubleComplex Deltastar /* measured error at the chosen network output. */ ;
/* calc top layer deltas. */
long TOP=Net.iLayerQty-1;
int ROWLEN=Net.iNeuronQTY[TOP];
//for(int i=0; i<Net.iNeuronQTY[TOP]; ++i){
for (long offset=0; (index =offset+tIx)< Net.iNeuronQTY[TOP] ; offset+=lTotalThreads){ // index stands for i
// delta-star = D - Y = Desired output minus actual output from evaluation
// D is the cplx coords of the sector of the desired answer Y is the complex result of evaluation of the given sample, unactivated. */
Deltastar = CxSubtractCxUT(
devLearn.gpuDOutputs[ IDX2C( index, s, ROWLEN ) ],
Signals[Net.iNeuronOfst[TOP]+index] );
/* divide the correction; delta = alpha * delta-star / n+1 (but alpha is always 1 for now). */
//Deltas[Net.iNeuronOfst[TOP]+index] = CxDivideRlUT( Deltastar, Net.iDendrtQTY[TOP] );
Deltas[Net.iNeuronOfst[TOP]+index] = CxMultiplyRlUT( Deltastar, Net.dINV_S[TOP] );
}
__syncthreads();
/* Now distribute the correction to lower layers if any. BPII.1 */
if (Net.iLayerQty>2){ /* remember layer 0 = inputs, layer 1 = bottom row, layer {2..iLayerQty-2} = middle row, layer iLayerQty-1 = top row. */
for (int L=Net.iLayerQty-1; L>1; --L){
long LAY = L; /* setup access to layers. */
long TRIB = L-1; /* trib for tributary.*/
int iTributQTY=Net.iNeuronQTY[TRIB];
//int Sj=Net.iDendrtQTY[TRIB]; if (TRIB==1) Sj=1; // Sj=1 for firest hidden layer
for (int i=1; i<Net.iNeuronQTY[LAY]; ++i) { // skip 0th neuron as its weights are either 1 (div identity) or 0 (div forbidden) and don't change anyway
// k index must begin at 1, neuron zero not valid for correction
//for (int k=1; k<iTributQTY; ++k) { /* the contribution to ith neuron's kth tributary's delta = i's delta/i's weight k. */
for (long offset=1; ( kindex =offset+tIx)< iTributQTY ; offset+=lTotalThreads){ // kindex stands for k
Deltas[Net.iNeuronOfst[TRIB]+kindex]
= CxAddCxUT ( Deltas[Net.iNeuronOfst[TRIB]+kindex] ,
CxDivideCxUT(
Deltas[Net.iNeuronOfst[LAY]+i] ,
Wt[IDX2C( Net.iWeightOfst[LAY]+kindex, i, iTributQTY )] ));
}
}
for (long offset=1; ( kindex =offset+tIx)< iTributQTY ; offset+=lTotalThreads){ // kindex stands for k
//cuDoubleComplex preDiv=Deltas[Net.iNeuronOfst[TRIB]+kindex]; // diagnostic purpose only, remove if removing other diags
//Deltas[Net.iNeuronOfst[TRIB]+kindex]
// = CxDivideRlUT(
// Deltas[Net.iNeuronOfst[TRIB]+kindex] ,
// Sj );
Deltas[Net.iNeuronOfst[TRIB]+kindex]
= CxMultiplyRlUT(
Deltas[Net.iNeuronOfst[TRIB]+kindex] ,
Net.dINV_S[TRIB] );
}
}
}
__syncthreads();
/* error distribution completed */
/* and now update the weights BP III */
/* adj weights on first hidden layer. */
int FHID = 1;
int SIG = 0;
int iSignalQTY=Net.iNeuronQTY[SIG]; //rSes.rLearn->iInputQty+1;
int iHidWidth=Net.iNeuronQTY[FHID];
for (int k=1; k<iHidWidth; ++k){
//for (int i=0; i<iSignalQTY; ++i){
for (long offset=0; ( index =offset+tIx)< iSignalQTY ; offset+=lTotalThreads){ // index stands for i
/* dW=d*xbar/s1/|z|= neuron's delta * input's conjugate / ( dendrites+1 * abs of input i ). */
Wt[IDX2C( Net.iWeightOfst[FHID]+index, k, iSignalQTY )]
=CxAddCxUT( Wt[IDX2C( Net.iWeightOfst[FHID]+index, k, iSignalQTY )] ,
CxDivideRlUT(
CxMultiplyCxUT(
Deltas[Net.iNeuronOfst[FHID]+k] ,
CxConjugateUT( Signals[Net.iNeuronOfst[SIG]+index] )
) ,
CxAbsUT( Zs[Net.iNeuronOfst[FHID]+k] ) // N+1 denominator factor is considered redundant - JAW & IA 2/27/12
)
);
}
}
__syncthreads();
/* re-evaluate sample to update temp values. */
subkEvalSampleBetaMT( devSes, s, Net, false, Signals, Zs, Wt, XInputs, YEval, dYEval);
if (Net.iLayerQty>2){
/* now use those outputs' conjugates and the deltas to adjust middle layers. BP III.1 */
for (int L=2; L<Net.iLayerQty-1; ++L){
/* setup access to layers. */
long LAY = L;
long TRIB = L-1;
//int iLayWidth=Net.iNeuronQTY[LAY];
int iTribWidth=Net.iNeuronQTY[TRIB];
for (int k=1; k<Net.iNeuronQTY[LAY]; ++k){
//for (int i=0; i<Net.iNeuronQTY[TRIB]; ++i){
for (long offset=0; ( index =offset+tIx)< Net.iNeuronQTY[TRIB] ; offset+=lTotalThreads){ // index stands for i
/* the adjustment added to kth neuron's ith trib's weight = k's delta * complex conjugate of i's signal / (abs of k's previous-wt product-sum * dendrites+1) . */
Wt[IDX2C( Net.iWeightOfst[LAY]+index, k, iTribWidth )]
=CxAddCxUT( Wt[IDX2C( Net.iWeightOfst[LAY]+index, k, iTribWidth )] ,
CxDivideRlUT(
CxMultiplyCxUT(
Deltas[Net.iNeuronOfst[LAY]+k] ,
CxConjugateUT( Signals[Net.iNeuronOfst[TRIB]+index] )
) ,
(
CxAbsUT( Zs[Net.iNeuronOfst[LAY]+k] ) // N+1 denominator factor is considered redundant - JAW & IA 2/27/12
)
)
);
}
}
/* layer is complete. */
subkEvalSampleBetaMT( devSes, s, Net, true, Signals, Zs, Wt, XInputs, YEval, dYEval);
}
}
__syncthreads();
/* correct output layer BP III.3 */
long SUB = TOP-1;
//int iTopWidth=Net.iNeuronQTY[TOP];
int iSubWidth=Net.iNeuronQTY[SUB];
for (int k=1; k<Net.iNeuronQTY[TOP]; ++k){
//for (int i=0; i<Net.iNeuronQTY[SUB]; ++i){
for (long offset=0; ( index =offset+tIx)< Net.iNeuronQTY[SUB] ; offset+=lTotalThreads){ // index stands for i
/* For last layer only, adjustment to kth neuron's ith weight = k's delta * complex conjugate of i's signal / ( dendrites+1) . */
Wt[IDX2C( Net.iWeightOfst[TOP]+index, k, iSubWidth )]
=CxAddCxUT( Wt[IDX2C( Net.iWeightOfst[TOP]+index, k, iSubWidth )] ,
CxMultiplyCxUT(
Deltas[Net.iNeuronOfst[TOP]+k] ,
CxConjugateUT( Signals[Net.iNeuronOfst[SUB]+index] )
)
); // N+1 denominator factor is considered redundant - JAW & IA 2/27/12
}
}
/* backprop is complete. */
}
__device__ void subkEvalSampleBetaMT(rohanContext& Ses, long s, rohanNetwork& Net, int o, cuDoubleComplex * Signals, cuDoubleComplex * Zs, cuDoubleComplex * Wt, cuDoubleComplex * XInputs, cuDoubleComplex * YEval, double * dYEval )
{// Beta uses fixed length fields instead of nested pointer layers
// delta squared is not updated, since they'll be updated when RMSE is checked at the end of a pass through the learning set
long index, kindex; // for warpwise loops
long tIx = threadIdx.x + blockDim.x * blockIdx.x; // tIx is thread index over the kernel
long lTotalThreads = gridDim.x * blockDim.x; // total number of threads
const cuDoubleComplex cdcZero = { 0, 0 };
/*! layer zero (inputs) is special. */
long INROWLEN=Net.iNeuronQTY[0];//rSes.rLearn->iInputQty+1;
//for (int i=0; i<INROWLEN; ++i){
for (long offset=0; (index =offset+tIx)< INROWLEN ; offset+=lTotalThreads){ // index stands for i
Signals[Net.iNeuronOfst[0]+index]= XInputs[IDX2C( index, s, INROWLEN )];
}
/*! middle and top layers. */
for (int L=1; L<Net.iLayerQty; ++L){
//struct rohanLayer& lay = Net.rLayer[L];
long LAY=L;
int TRIB=L-1; // index of previous layer
int iNeuronQTY=Net.iNeuronQTY[LAY];
int iSignalQTY=Net.iDendrtQTY[LAY]; // signal qty depends on size of previous layer
//for (int k=0; k<iNeuronQTY; ++k){ //Neuron zero is not skipped, its output should be 1+0i as a check
for (long offset=0; (kindex =offset+tIx)< iNeuronQTY ; offset+=lTotalThreads){ // kindex stands for k
Zs[Net.iNeuronOfst[LAY]+kindex]=cdcZero;
for (int i=0; i<iSignalQTY; ++i){ //walk weights on inputs from previous layer
Zs[Net.iNeuronOfst[LAY]+kindex] =
CxAddCxUT( Zs[Net.iNeuronOfst[LAY]+kindex] ,
CxMultiplyCxUT(
Wt[IDX2C( Net.iWeightOfst[LAY] + i, kindex, iSignalQTY )],
Signals[Net.iNeuronOfst[TRIB]+i] ) ) ;
}
// ACTIVATE //
Signals[Net.iNeuronOfst[LAY]+kindex] = CxActivateUT( Zs[Net.iNeuronOfst[LAY]+kindex]);
}
}
/*! last layer values are converted and stored here */
long TOP = Net.iLayerQty-1;
long OUTROWLEN=Net.iNeuronQTY[TOP];
//for (int i=0; i<Net.iNeuronQTY[TOP]; ++i){ // continuous conversion begins here
for (long offset=0; (index =offset+tIx)< OUTROWLEN ; offset+=lTotalThreads){ // index stands for i
YEval[IDX2C( index, s, OUTROWLEN )]= Signals[Net.iNeuronOfst[TOP]+index] ; // store final complex output(s)
dYEval[IDX2C( index, s, OUTROWLEN )]=FUnitCxUT( YEval[IDX2C( index, s, OUTROWLEN )] ) * Net.iSectorQty; // convert final complex outputs to sectors and store that
if(devLearn.iContOutputs==false) // round off decimal if disc activation is set
dYEval[IDX2C( index, s, OUTROWLEN )]=int(dYEval[IDX2C( index, s, OUTROWLEN )]);
}
/*! end of sample evaluation. */
}
__device__ cuDoubleComplex CxActivateUT(const cuDoubleComplex Z)
{/// applies ContActivation or discrete activation function to cx neuron output and returns Phi(Z)
/// This fn should be phased out in favor of a GPU device vector based fn
cuDoubleComplex phi;
if (devNet.bContActivation) { // apply ContActivation activation function to weighted sum : phi(z)=z/|z|
phi = CxDivideRlUT( Z, CxAbsUT( Z ) );
}
else { // apply Discrete activation function to weighted sum : s=int(arctan(z)*k/2pi), phi(z)=(X(s),Y(s))
double theta = atan2(Z.y, Z.x); // theta = arctan y/x
int iSector = (int)((theta * devNet.dK_DIV_TWO_PI) + devNet.iSectorQty) % devNet.iSectorQty;
phi = devNet.gpuSectorBdry[iSector];
//printf(" %f+%fi %d Activate\n", phi.x, phi.y, threadIdx.x);
}
return phi;
}

So, I'm not going to read all that code, but I can give you a strong hint. The warp size is 32 threads, so the 64-thread case will run two warps/block -- in the former case you can't have any instruction pointer based race conditions, however, in the second case, you will effectively have two groups of threads with different IPs scheduled at different times. You may already know much of this (hence the syncthreads), but the above really makes it almost certain that you simply have one more race condition you haven't accounted for yet.
Putting in the sync-threads is a good start to try and isolate it. Are you sure that in your loops, the source data of one warp is not overwritten by the other warp? If not try put in syncthreads into your inner loops just for debug purposes to see what may be causing the race condition.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMp parallel for - c++

Related

Accumulating Doubles Into Bins via intrinsics

Implementing De Boors algorithm for finding points on a B-spline

Fast percentile in C++

Parallel Computing with OpenMP or _gnu_parallel does not speed up the code

Different number of threads, different answers [closed]

Categories

Resources