Openmp nested for loop with ordered output - c++

I'm currently trying to find a fast and reliable way to parallelize a set of loops with if conditions where I need to save a result in the inner loop.
The code is supposed to go through a huge amount of points in a 3D grid. For some points within this volume I have to check another condition (checking for an angle) and if this condition fulfilled I have to calculate a density.
The fastest ways so far were #pragma omp parallel for private (x,y,z) collapse(3) outside of all for loops or #pragma omp parallel for for the inner most loop (phiInd) which is not only the largest loop but also calls a CPU-intensive function.
I need to store the density value in the densityarr within the inner loop. The densityarray is then later saved seperately.
My problem now is, that depending on the number of threads I set, I get different results in ,y density array. The serial version and an openmp run with just 1 thread have identical results.
Increasing the number of threads leads to results at the same points, but those results are different from the serial version.
I know there is #pragma omp for ordered but this results in a too slow calculation.
Is there a way to parallelize this loop while still getting my results ordered according to my points (x,y,z)?
Or maybe clearer: Why does increasing the thread number change my result?
double phipoint, Rpoint, zpoint;
double phiplane;
double distphi = 2.0 * M_PI / nPlanes; //set desired distace to phi to assign point to fluxtubeplane
double* densityarr = new double[max_x_steps * max_y_steps * max_z_steps];
for (z = 0; z < max_z_steps; z++) {
for (x = 0; x < max_x_steps; x++) {
for (y = 0; y < max_y_steps; y++) {
double x_center = x * stepSizeGrid - max_x / 2;
double y_center = y * stepSizeGrid - max_y / 2;
double z_center = z * stepSizeGrid - max_z / 2;
cartesianCoordinate* pos = new cartesianCoordinate(x_center, y_center, z_center);
linearToroidalCoordinate* tor = linearToroidal(*pos);
simpleToroidalCoordinate* stc = simpleToroidal(*pos);
phipoint = tor->phi;
if (stc->r <= 0.174/*0.175*/) {//check if point is in vessel
for (int phiInd = 0; phiInd < nPlanes; ++phiInd) {
phiplane = phis[phiInd];
if (abs(phipoint - phiplane) <= distphi) {//find right plane for point
Rpoint = tor->R;
zpoint = tor->z;
densityarr[z * max_y_steps * max_x_steps + x * max_y_steps + y] = TubePlanes[phiInd].getMinDistDensity(Rpoint, zpoint);
}
}
}
delete pos, tor, stc;
}
}
}

First, you need to address the errors in your parallel versions. You race-conditions writing to the shared variables phipoint (parallel outer loops) and phiplane,Rpoint,zpoint (any loops parallel). You must declare those private, or better yet, declare them locally in the first place (which makes them implicitly private). That way the code is much easier to reason about - which is very important for parallel codes.
Parallelizing the outer loops like you describe is the obvious and very likely most efficient approach. If there are severe load imbalances (stc->r <= 0.174 not being evenly distributed among the points), you might want to use schedule(dynamic).
Parallelizing the inner loop seems unnecessary in your case. Generally outer loops provide better efficiency because of less overhead - unless they don't expose enough parallel work, have some race conditions, have dependencies, or cache issues. It would however be a worthwhile exercise to try and measure it. However, there may be a race condition upon writing to densityarr, if more than one of the phis satisfy the condition. Overall that loop seems a bit odd - since you only use at most one of the results in densityarr, you could rather reverse the loop and cancel once you found the first one. That helps serial execution a lot, but may inhibit parallelization. Also, if you don't find a phi that satisfies the condition - or if the point is not in the vessel, then the respective entry in densityarr remains uninitialized - that can be very dangerous because you cannot later determine if the value is valid or not.
A general remark, don't allocate objects with new unless you need to. Just put pos on the stack, likely gives you better performance. It can be a performance issue to allocate memory within a (parallel) loop, so you might want to rethink the way you get your Toroidals.
Note that I do assume that TubePlanes[phiInd].getMinDistDensity has no side effects, otherwise parallelization would be problematic.

Related

Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?

I am the author of an open source scientific code called vampire ( http://github.com/richard-evans/vampire ), and being compute intensive means any improvement in code performance can significantly increase the amount of research that can be done. Typical runtimes of this code can be hundreds of core hours, so I am always looking for ways to improve the performance critical sections of the code. However, I have come a bit stuck with the following bit of relatively innocuous looking bit of code, which makes up around 40% of the runtime:
for (int atom = start_index; atom < end_index; atom++){
register double Hx = 0.0;
register double Hy = 0.0;
register double Hz = 0.0;
const int start = atoms::neighbour_list_start_index[atom];
const int end = atoms::neighbour_list_end_index[atom] + 1;
for (int nn = start; nn < end; nn++){
const int natom = atoms::neighbour_list_array[nn];
const double Jij = atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij;
Hx -= Jij * atoms::x_spin_array[natom];
Hy -= Jij * atoms::y_spin_array[natom];
Hz -= Jij * atoms::z_spin_array[natom];
}
atoms::x_total_spin_field_array[atom] += Hx;
atoms::y_total_spin_field_array[atom] += Hy;
atoms::z_total_spin_field_array[atom] += Hz;
}
The high level overview of the function and variables of this code is as follows: There is a 1D array of a physical vector ( split into three 1D arrays for each component x,y,z for memory caching purposes, atoms::x_spin_array, etc) called 'spin'. Each of these spins interact with some other spins, and all the interactions are stored as a 1D neighbour list (atoms::neighbour_list_array). The relevant list of interactions for each atom is determined by a start and end index of the neighbor listarray in two separate arrays. At the end of the calculation the each atomic spin has an effective field which is the vector sum of the interactions.
Given the small amount of code and the sizable fraction of the runtime it occupies I have done by best, but I feel there must be a way to optimize this further, but as a physicist rather than a computer scientist maybe I am missing something?
You've got a constant stream of multiply, subtract and adds on contiguous data. That's seems like an ideal use of SSE. If it's memory bandwidth limited, then OpenCL/CUDA instead.
Try using this library if you aren't familiar with all the low level instructions.
That inner loop could potentially then be restructured significantly maybe leading to speed ups.
If the x, y, z components are indeed linked lists, doing x[i], y[i] and z[i] will cause the lists to be traversed multiple times, giving (n^2)/2 iterations. Using vectors will make this an O(1) operation.
You mention that the three coordinates are split out for memory caching purposes, but this will affect the Level 1 and Level 2 cache locality as you are accessing 3 different areas in memory. The linked list is also impacting your cache locality.
Using something like:
struct vector3d {
double x;
double y;
double z;
};
std::vector<vector3d> spin;
std::vector<vector3d> total_spin;
This should improve the cache locality, as the x, y and z values are adjacent in memory and the spins occupy a linear block of memory.
I feel following suggestions can help you optimize the code a bit if not completely:
Use initialization over assignments wherever possible
Prefer pre-increment over post for better speed.(believe me, it does make a change)
Apart from that I think the code is just fine. There are some pros and cons of each DS..you gotta live with it.
Happy Coding!

OpenMP and memory bandwidth restriction

Edit: My first code sample was wrong. Fixed with a simpler.
I implement a C++ library for algebraic operations between large vectors and matrices.
I found on x86-x64 CPUs that OpenMP parallel vector additions, dot product etc are not going so faster than single threaded.
Parallel operations are -1% - 6% faster than single threaded.
This happens because of memory bandwidth limitation (I think).
So, the question is, is there real performance benefit for code like this:
void DenseMatrix::identity()
{
assert(height == width);
size_t i = 0;
#pragma omp parallel for if (height > OPENMP_BREAK2)
for(unsigned int y = 0; y < height; y++)
for(unsigned int x = 0; x < width; x++, i++)
elements[i] = x == y ? 1 : 0;
}
In this sample there is no serious drawback from using OpenMP.
But if I am working on OpenMP with Sparse Vectors and Sparse Matrices, I cannot use for instance *.push_back(), and in that case, question becomes serious. (Elements of sparse vectors are not continuous like dense vectors, so parallel programming has a drawback because result elements can arrive anytime - not for lower to higher index)
I don't think this is a problem of memory bandwidth. I see clearly a problem on r: r is accessed from multiple threads, which causes both data races and false sharing. False sharing can dramatically hurt your performance.
I'm wondering whether you can get even the correct answer, because there are data races on r. Did you get the correct answer?
However, the solution would be very simple. The operation conducted on r is reduction, which can be easily achieved by reduction clause of OpenMP.
http://msdn.microsoft.com/en-us/library/88b1k8y5(v=vs.80).aspx
Try to simply append reduction(+ : r) after #pragma omp parallel.
(Note: Additions on double are not commutative and associative. You may see some precision errors, or some differences with the result of the serial code.)

Efficiently Building Summed Area Table

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.
You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
* This restriction may be lifted in AVX2, I'm not sure...
Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.

Digital filter and std::inner_product optimization

In a digital filtering C++ application, I use std::inner_product (with std::vector<double> and std::deque<double>) to compute the dot product between the filter coefficients and the input data, for each data sample. After profiling my application, I figured out that no less than 85% of the execution time is spent in std::inner_product!
To what extend is std::inner_product optimized, in GCC for example?
Does it uses SIMD instructions? Does it performs loop unrolling? How to make sure of that?
Based on this, would it worth it to implement custom dot product function(s) (especially if the number of coefficient is low)? (but I would like to keep the function as generic as possible)
More specifically, this is the piece of code I use to apply a filter:
std::deque<double> in(filterNum.size(), 0.0);
std::deque<double> out(filterDenom.size() - 1, 0.0);
const double gain = filterDenom.back();
for (unsigned int s = 0, size = data.size(); s < size; ++s) {
in.pop_front();
in.push_back(data[s] / gain);
data[s] = inner_product(in.begin(), in.end(), filterNum.begin(),
-inner_product(out.begin(), out.end(), filterDenom.begin(), 0.0));
out.pop_front();
out.push_back(data[s]);
}
Typically, I use second order bandpass IIR filters, which means that the size of filterNum and filterDenom (numerator and denominator coefficients of the filter) is 5. data is the vector containing the input samples.
Getting an additional factor of 2 out of this shouldn't be hard if you just write the code directly. Part of it might come from removing some of the generality of inner_product, but some would also come from such things as eliminating the use of deques - if you just keep a pointer into your input array you can index off it and off the filter array in the inner loop, and increment the pointer to the input array in the outer loop.
Each of those inner_products has to use iterators through deques,
Most of the (coding) effort then becomes handling the edge conditions.
And take that division out of there - it should be a multiplication by a constant calculated outside the loop.
Inner product itself is pretty efficient (there's not much to do there), but it needs to increment two iterators on each pass through the inner loop. There is no explicit loop unrolling, but a good compiler can unroll a loop that simple. And a compiler is more likely to know how far to unroll a loop before running into instruction cache issues.
Deque iterators are not nearly as efficient as ++ on a pure pointer. There is at least a test on every ++, and there may be more than one assignment.
This is what a simple (FIR) filter can look like, without including the code for the edge conditions (which goes outside of the loop)
double norm = 1.0/sum;
double *p = data.values(); // start of input data
double *q = output.values(); // start of output buffer
int width = data.size() - filter.size();
for( int i = 0; i < width; ++i )
{
double *f = filter.values();
double accumulator = ( f[0] * p[0] );
for( int j = 1; j < filter.size(); ++j )
{
accumulator += ( f[i] * p[i] );
}
*q++ = accumulator * norm;
}
Note that there are messy details left out, and this is not the same as your filter, but it gives the idea. What's inside the outer loop easily fits in a modern instruction cache. The inner loop may be unrolled by the compiler. Most modern architectures can do the add and multiply in parallel.
You can ask GCC to computes most of the algorithms in <algorithms> and <numeric> in parallel mode, it may give a performance boost if your data set is very high (I think that it really only uses OpenMP inside).
However on small datasets it may give a performance hit.
A comparison with the other solution would be more than welcome!
http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

openMP histogram comparison

I am working on the code that compares image histograms, buy calculating correlation, intersection, ChiSquare and few other methods. General look of these functions are very similar to each other.
Usually I working with pthreads, but this time I decided to build small prototype with openMP (due to it simplicity) and see what kind of results I will get.
This is example of comparing by correlation, code is identical to serial implementation except single line of openMP loop.
double comp(CHistogram* h1, CHistogram* h2){
double Sa = 0;
double Sb = 0;
double Saa = 0;
double Sbb = 0;
double Sab = 0;
double a, b;
int N = h1->length;
#pragma omp parallel for reduction(+:Sa,Sb,Saa,Sbb,Sab) private(a ,b)
for (int i = 0; i<N;i++){
a =h1->data[i];
b =h2->data[i];
Sa+=a;
Sb+=b;
Saa+=a*a;
Sbb+=b*b;
Sab+=a*b;
}
double sUp = Sab - Sa*Sb / N;
double sDown = (Saa-Sa*Sa / N)*(Sbb-Sb*Sb / N);
return sUp / sqrt(sDown);
}
Are there more ways to speed up this function with openMP ?
Thanks!
PS: I know that fastest way would be just to compare different pairs of histograms across multiple threads, but this is not applicable to my situation since only 2 histograms are available at a time.
Tested on quad core machine
I have a little bit of uncertainty, on a longer run openmp seems to perform better than a serial. But if I compare it just for a single histogram and measure time in useconds, then serial is faster in about 20 times.
I guess openmp puts some optimization once it see outside for loop. But in a real solution I will have some code in between histogram comparisons, and I not sure if it will behave the same way.
OpenMp takes some time to set up the parallel region. This overhead means you need to be careful that the overhead isn't greater than the performance that is gained by setting up a parallel region. In your case this means that only when N reaches a certain number will openMP speed up the calculation.
You should think about ways to reduce the total number of openMP calls, for instance is it possible to set up a parallel region outside this function so that you compare different histograms in parallel?