OpenMP and memory bandwidth restriction - c++

Edit: My first code sample was wrong. Fixed with a simpler.
I implement a C++ library for algebraic operations between large vectors and matrices.
I found on x86-x64 CPUs that OpenMP parallel vector additions, dot product etc are not going so faster than single threaded.
Parallel operations are -1% - 6% faster than single threaded.
This happens because of memory bandwidth limitation (I think).
So, the question is, is there real performance benefit for code like this:
void DenseMatrix::identity()
{
assert(height == width);
size_t i = 0;
#pragma omp parallel for if (height > OPENMP_BREAK2)
for(unsigned int y = 0; y < height; y++)
for(unsigned int x = 0; x < width; x++, i++)
elements[i] = x == y ? 1 : 0;
}
In this sample there is no serious drawback from using OpenMP.
But if I am working on OpenMP with Sparse Vectors and Sparse Matrices, I cannot use for instance *.push_back(), and in that case, question becomes serious. (Elements of sparse vectors are not continuous like dense vectors, so parallel programming has a drawback because result elements can arrive anytime - not for lower to higher index)

I don't think this is a problem of memory bandwidth. I see clearly a problem on r: r is accessed from multiple threads, which causes both data races and false sharing. False sharing can dramatically hurt your performance.
I'm wondering whether you can get even the correct answer, because there are data races on r. Did you get the correct answer?
However, the solution would be very simple. The operation conducted on r is reduction, which can be easily achieved by reduction clause of OpenMP.
http://msdn.microsoft.com/en-us/library/88b1k8y5(v=vs.80).aspx
Try to simply append reduction(+ : r) after #pragma omp parallel.
(Note: Additions on double are not commutative and associative. You may see some precision errors, or some differences with the result of the serial code.)

Related

Openmp nested for loop with ordered output

I'm currently trying to find a fast and reliable way to parallelize a set of loops with if conditions where I need to save a result in the inner loop.
The code is supposed to go through a huge amount of points in a 3D grid. For some points within this volume I have to check another condition (checking for an angle) and if this condition fulfilled I have to calculate a density.
The fastest ways so far were #pragma omp parallel for private (x,y,z) collapse(3) outside of all for loops or #pragma omp parallel for for the inner most loop (phiInd) which is not only the largest loop but also calls a CPU-intensive function.
I need to store the density value in the densityarr within the inner loop. The densityarray is then later saved seperately.
My problem now is, that depending on the number of threads I set, I get different results in ,y density array. The serial version and an openmp run with just 1 thread have identical results.
Increasing the number of threads leads to results at the same points, but those results are different from the serial version.
I know there is #pragma omp for ordered but this results in a too slow calculation.
Is there a way to parallelize this loop while still getting my results ordered according to my points (x,y,z)?
Or maybe clearer: Why does increasing the thread number change my result?
double phipoint, Rpoint, zpoint;
double phiplane;
double distphi = 2.0 * M_PI / nPlanes; //set desired distace to phi to assign point to fluxtubeplane
double* densityarr = new double[max_x_steps * max_y_steps * max_z_steps];
for (z = 0; z < max_z_steps; z++) {
for (x = 0; x < max_x_steps; x++) {
for (y = 0; y < max_y_steps; y++) {
double x_center = x * stepSizeGrid - max_x / 2;
double y_center = y * stepSizeGrid - max_y / 2;
double z_center = z * stepSizeGrid - max_z / 2;
cartesianCoordinate* pos = new cartesianCoordinate(x_center, y_center, z_center);
linearToroidalCoordinate* tor = linearToroidal(*pos);
simpleToroidalCoordinate* stc = simpleToroidal(*pos);
phipoint = tor->phi;
if (stc->r <= 0.174/*0.175*/) {//check if point is in vessel
for (int phiInd = 0; phiInd < nPlanes; ++phiInd) {
phiplane = phis[phiInd];
if (abs(phipoint - phiplane) <= distphi) {//find right plane for point
Rpoint = tor->R;
zpoint = tor->z;
densityarr[z * max_y_steps * max_x_steps + x * max_y_steps + y] = TubePlanes[phiInd].getMinDistDensity(Rpoint, zpoint);
}
}
}
delete pos, tor, stc;
}
}
}
First, you need to address the errors in your parallel versions. You race-conditions writing to the shared variables phipoint (parallel outer loops) and phiplane,Rpoint,zpoint (any loops parallel). You must declare those private, or better yet, declare them locally in the first place (which makes them implicitly private). That way the code is much easier to reason about - which is very important for parallel codes.
Parallelizing the outer loops like you describe is the obvious and very likely most efficient approach. If there are severe load imbalances (stc->r <= 0.174 not being evenly distributed among the points), you might want to use schedule(dynamic).
Parallelizing the inner loop seems unnecessary in your case. Generally outer loops provide better efficiency because of less overhead - unless they don't expose enough parallel work, have some race conditions, have dependencies, or cache issues. It would however be a worthwhile exercise to try and measure it. However, there may be a race condition upon writing to densityarr, if more than one of the phis satisfy the condition. Overall that loop seems a bit odd - since you only use at most one of the results in densityarr, you could rather reverse the loop and cancel once you found the first one. That helps serial execution a lot, but may inhibit parallelization. Also, if you don't find a phi that satisfies the condition - or if the point is not in the vessel, then the respective entry in densityarr remains uninitialized - that can be very dangerous because you cannot later determine if the value is valid or not.
A general remark, don't allocate objects with new unless you need to. Just put pos on the stack, likely gives you better performance. It can be a performance issue to allocate memory within a (parallel) loop, so you might want to rethink the way you get your Toroidals.
Note that I do assume that TubePlanes[phiInd].getMinDistDensity has no side effects, otherwise parallelization would be problematic.

What is the most time efficient way to square each element of a vector of vectors c++

I currently have a vector of vectors of a float type, which contain some data:
vector<vector<float> > v1;
vector<vector<float> > v2;
I wanted to know what is the fasted way to square each element in v1 and store it in v2? Currently I am just accessing each element of v1 multiplying it by itself and storing it in v2. As seen below:
for(int i = 0; i < 10; i++){
for(int j = 0; j < 10; j++){
v2[i][j] = v1[i][j]*v[i][j];
}
}
With a bit of luck, the compiler you are using understands what you want to do and converts it so it uses sse-instruction of the cpu which do your squaring in parallel. In this case your code is close to the optimal speed (on single core). You could also try the eigen-library (http://eigen.tuxfamily.org/) which provides some more reliable means to achieve high performance. You would then get something like
ArrayXXf v1 = ArrayXXf::Random(10, 10);
ArrayXXf v2 = v1.square();
which also makes your intention more clear.
If you want to stay in CPU world, OpenMP should help you easily. A single #pragma omp parallel for will divide the load between available cores and you could get further gains by telling the compiler to vectorize with ivdep and simd pragmas.
If GPU is an option, this is a matrix calculation which is perfect for OpenCL. Google for OpenCL matrix multiplication examples. Basically, you can have 2000 threads executing a single operation or fewer threads operating on vector chunks and kernel is very simple to write.

cache friendly C++ operation on matrix in C++?

My application does some operations on matrices of large size.
I recently came accross the concept of cache & the performance effect it can have through this answer.
I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000.
Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000.
Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999
If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].
In summary, the second algorithm is very likely better for nearly all options.
Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).

openMP histogram comparison

I am working on the code that compares image histograms, buy calculating correlation, intersection, ChiSquare and few other methods. General look of these functions are very similar to each other.
Usually I working with pthreads, but this time I decided to build small prototype with openMP (due to it simplicity) and see what kind of results I will get.
This is example of comparing by correlation, code is identical to serial implementation except single line of openMP loop.
double comp(CHistogram* h1, CHistogram* h2){
double Sa = 0;
double Sb = 0;
double Saa = 0;
double Sbb = 0;
double Sab = 0;
double a, b;
int N = h1->length;
#pragma omp parallel for reduction(+:Sa,Sb,Saa,Sbb,Sab) private(a ,b)
for (int i = 0; i<N;i++){
a =h1->data[i];
b =h2->data[i];
Sa+=a;
Sb+=b;
Saa+=a*a;
Sbb+=b*b;
Sab+=a*b;
}
double sUp = Sab - Sa*Sb / N;
double sDown = (Saa-Sa*Sa / N)*(Sbb-Sb*Sb / N);
return sUp / sqrt(sDown);
}
Are there more ways to speed up this function with openMP ?
Thanks!
PS: I know that fastest way would be just to compare different pairs of histograms across multiple threads, but this is not applicable to my situation since only 2 histograms are available at a time.
Tested on quad core machine
I have a little bit of uncertainty, on a longer run openmp seems to perform better than a serial. But if I compare it just for a single histogram and measure time in useconds, then serial is faster in about 20 times.
I guess openmp puts some optimization once it see outside for loop. But in a real solution I will have some code in between histogram comparisons, and I not sure if it will behave the same way.
OpenMp takes some time to set up the parallel region. This overhead means you need to be careful that the overhead isn't greater than the performance that is gained by setting up a parallel region. In your case this means that only when N reaches a certain number will openMP speed up the calculation.
You should think about ways to reduce the total number of openMP calls, for instance is it possible to set up a parallel region outside this function so that you compare different histograms in parallel?