openMP histogram comparison - c++

I am working on the code that compares image histograms, buy calculating correlation, intersection, ChiSquare and few other methods. General look of these functions are very similar to each other.
Usually I working with pthreads, but this time I decided to build small prototype with openMP (due to it simplicity) and see what kind of results I will get.
This is example of comparing by correlation, code is identical to serial implementation except single line of openMP loop.
double comp(CHistogram* h1, CHistogram* h2){
double Sa = 0;
double Sb = 0;
double Saa = 0;
double Sbb = 0;
double Sab = 0;
double a, b;
int N = h1->length;
#pragma omp parallel for reduction(+:Sa,Sb,Saa,Sbb,Sab) private(a ,b)
for (int i = 0; i<N;i++){
a =h1->data[i];
b =h2->data[i];
Sa+=a;
Sb+=b;
Saa+=a*a;
Sbb+=b*b;
Sab+=a*b;
}
double sUp = Sab - Sa*Sb / N;
double sDown = (Saa-Sa*Sa / N)*(Sbb-Sb*Sb / N);
return sUp / sqrt(sDown);
}
Are there more ways to speed up this function with openMP ?
Thanks!
PS: I know that fastest way would be just to compare different pairs of histograms across multiple threads, but this is not applicable to my situation since only 2 histograms are available at a time.
Tested on quad core machine
I have a little bit of uncertainty, on a longer run openmp seems to perform better than a serial. But if I compare it just for a single histogram and measure time in useconds, then serial is faster in about 20 times.
I guess openmp puts some optimization once it see outside for loop. But in a real solution I will have some code in between histogram comparisons, and I not sure if it will behave the same way.

OpenMp takes some time to set up the parallel region. This overhead means you need to be careful that the overhead isn't greater than the performance that is gained by setting up a parallel region. In your case this means that only when N reaches a certain number will openMP speed up the calculation.
You should think about ways to reduce the total number of openMP calls, for instance is it possible to set up a parallel region outside this function so that you compare different histograms in parallel?

Related

Strange behavior in matrix formation (C++, Armadillo)

I have a while loop that continues as long as energy variable (type double) has not converged to below a certain threshold. One of the variables needed to calculate this energy is an Armadillo matrix of doubles, named f_mo. In the while loop, this f_mo updates iteratively, so I calculate f_mo at the beginning of each loop as:
arma::mat f_mo = h_core_mo;//h_core_mo is an Armadillo matrix of doubles
for(size_t p = 0; p < n_mo; p++) {//n_mo is of type size_t
for(size_t q = 0; q < n_mo; q++) {
double sum = 0.0;
for(size_t i = 0; i < n_occ; i++) {//n_occ is of type size_t
//f_mo(p,q) += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q); //all g_mat_ are Armadillo matrices of doubles
sum += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q);
}
for(size_t i2 = 0; i2 < n_occ2; i2++) {//n_occ2 is of type size_t
//f_mo(p,q) -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
sum -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
}
f_mo(p,q) +=sum;
}}
But say I replace the sum (which I add at the end to f_mo(p,q)) with addition to f_mo(p,q) directly (the commented out code). The output f_mo matrices are identical to machine precision. Nothing about the code should change. The only variables affected in the loop are sum and f_mo. And YET, the code converges to a different energy and in vastly different number of while loop iterations. I am at a loss as to the cause of the difference. When I run the same code 2,3,4,5 times, I get the same result every time. When I recompile with no optimization, I get the same issue. When I run on a different computer (controlling for environment), I yet again get a discrepancy in # of while loops despite identical f_mo, but the total number of iterations for each method (sum += and f_mo(p,q) += ) differ.
It is worth noting that the point at which the code outputs differ is always g_mat_full_qp1_qp2_mo, which is recalculated later in the while loop. HOWEVER, every variable going into the calculation of g_mat_full_qp1_qp2_mo is identical between the two codes. This leads me to think there is something more profound about C++ that I do not understand. I welcome any ideas as to how you would proceed in debugging this behavior (I am all but certain it is not a typical bug, and I've controlled for environment and optimization)
I'm going to assume this a Hartree-Fock, or some other kind of electronic structure calculation where you adding the two-electron integrals to the core Hamiltonian, and apply some domain knowledge.
Part of that assume is the individual elements of the two-electron integrals are very small, in particular compared to the core Hamiltonian. Hence as 1201ProgramAlarm mentions in their comment, the order of addition will matter. You will get a more accurate result if you add smaller numbers together first to avoid loosing precision when adding two numbers many orders of magintude apart.. Because you iterate this processes until the Fock matrix f_mo has tightly converged you eventually converge to the same value.
In order to add up the numbers in a more accurate order, and hopefully converge faster, most electronic structure programs have a seperate routine to calculate the two-electron integrals, and then add them to the core Hamiltonian, which is what you are doing, element by element, in your example code.
Presentation on numerical computing.

Openmp nested for loop with ordered output

I'm currently trying to find a fast and reliable way to parallelize a set of loops with if conditions where I need to save a result in the inner loop.
The code is supposed to go through a huge amount of points in a 3D grid. For some points within this volume I have to check another condition (checking for an angle) and if this condition fulfilled I have to calculate a density.
The fastest ways so far were #pragma omp parallel for private (x,y,z) collapse(3) outside of all for loops or #pragma omp parallel for for the inner most loop (phiInd) which is not only the largest loop but also calls a CPU-intensive function.
I need to store the density value in the densityarr within the inner loop. The densityarray is then later saved seperately.
My problem now is, that depending on the number of threads I set, I get different results in ,y density array. The serial version and an openmp run with just 1 thread have identical results.
Increasing the number of threads leads to results at the same points, but those results are different from the serial version.
I know there is #pragma omp for ordered but this results in a too slow calculation.
Is there a way to parallelize this loop while still getting my results ordered according to my points (x,y,z)?
Or maybe clearer: Why does increasing the thread number change my result?
double phipoint, Rpoint, zpoint;
double phiplane;
double distphi = 2.0 * M_PI / nPlanes; //set desired distace to phi to assign point to fluxtubeplane
double* densityarr = new double[max_x_steps * max_y_steps * max_z_steps];
for (z = 0; z < max_z_steps; z++) {
for (x = 0; x < max_x_steps; x++) {
for (y = 0; y < max_y_steps; y++) {
double x_center = x * stepSizeGrid - max_x / 2;
double y_center = y * stepSizeGrid - max_y / 2;
double z_center = z * stepSizeGrid - max_z / 2;
cartesianCoordinate* pos = new cartesianCoordinate(x_center, y_center, z_center);
linearToroidalCoordinate* tor = linearToroidal(*pos);
simpleToroidalCoordinate* stc = simpleToroidal(*pos);
phipoint = tor->phi;
if (stc->r <= 0.174/*0.175*/) {//check if point is in vessel
for (int phiInd = 0; phiInd < nPlanes; ++phiInd) {
phiplane = phis[phiInd];
if (abs(phipoint - phiplane) <= distphi) {//find right plane for point
Rpoint = tor->R;
zpoint = tor->z;
densityarr[z * max_y_steps * max_x_steps + x * max_y_steps + y] = TubePlanes[phiInd].getMinDistDensity(Rpoint, zpoint);
}
}
}
delete pos, tor, stc;
}
}
}
First, you need to address the errors in your parallel versions. You race-conditions writing to the shared variables phipoint (parallel outer loops) and phiplane,Rpoint,zpoint (any loops parallel). You must declare those private, or better yet, declare them locally in the first place (which makes them implicitly private). That way the code is much easier to reason about - which is very important for parallel codes.
Parallelizing the outer loops like you describe is the obvious and very likely most efficient approach. If there are severe load imbalances (stc->r <= 0.174 not being evenly distributed among the points), you might want to use schedule(dynamic).
Parallelizing the inner loop seems unnecessary in your case. Generally outer loops provide better efficiency because of less overhead - unless they don't expose enough parallel work, have some race conditions, have dependencies, or cache issues. It would however be a worthwhile exercise to try and measure it. However, there may be a race condition upon writing to densityarr, if more than one of the phis satisfy the condition. Overall that loop seems a bit odd - since you only use at most one of the results in densityarr, you could rather reverse the loop and cancel once you found the first one. That helps serial execution a lot, but may inhibit parallelization. Also, if you don't find a phi that satisfies the condition - or if the point is not in the vessel, then the respective entry in densityarr remains uninitialized - that can be very dangerous because you cannot later determine if the value is valid or not.
A general remark, don't allocate objects with new unless you need to. Just put pos on the stack, likely gives you better performance. It can be a performance issue to allocate memory within a (parallel) loop, so you might want to rethink the way you get your Toroidals.
Note that I do assume that TubePlanes[phiInd].getMinDistDensity has no side effects, otherwise parallelization would be problematic.

Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?

I am the author of an open source scientific code called vampire ( http://github.com/richard-evans/vampire ), and being compute intensive means any improvement in code performance can significantly increase the amount of research that can be done. Typical runtimes of this code can be hundreds of core hours, so I am always looking for ways to improve the performance critical sections of the code. However, I have come a bit stuck with the following bit of relatively innocuous looking bit of code, which makes up around 40% of the runtime:
for (int atom = start_index; atom < end_index; atom++){
register double Hx = 0.0;
register double Hy = 0.0;
register double Hz = 0.0;
const int start = atoms::neighbour_list_start_index[atom];
const int end = atoms::neighbour_list_end_index[atom] + 1;
for (int nn = start; nn < end; nn++){
const int natom = atoms::neighbour_list_array[nn];
const double Jij = atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij;
Hx -= Jij * atoms::x_spin_array[natom];
Hy -= Jij * atoms::y_spin_array[natom];
Hz -= Jij * atoms::z_spin_array[natom];
}
atoms::x_total_spin_field_array[atom] += Hx;
atoms::y_total_spin_field_array[atom] += Hy;
atoms::z_total_spin_field_array[atom] += Hz;
}
The high level overview of the function and variables of this code is as follows: There is a 1D array of a physical vector ( split into three 1D arrays for each component x,y,z for memory caching purposes, atoms::x_spin_array, etc) called 'spin'. Each of these spins interact with some other spins, and all the interactions are stored as a 1D neighbour list (atoms::neighbour_list_array). The relevant list of interactions for each atom is determined by a start and end index of the neighbor listarray in two separate arrays. At the end of the calculation the each atomic spin has an effective field which is the vector sum of the interactions.
Given the small amount of code and the sizable fraction of the runtime it occupies I have done by best, but I feel there must be a way to optimize this further, but as a physicist rather than a computer scientist maybe I am missing something?
You've got a constant stream of multiply, subtract and adds on contiguous data. That's seems like an ideal use of SSE. If it's memory bandwidth limited, then OpenCL/CUDA instead.
Try using this library if you aren't familiar with all the low level instructions.
That inner loop could potentially then be restructured significantly maybe leading to speed ups.
If the x, y, z components are indeed linked lists, doing x[i], y[i] and z[i] will cause the lists to be traversed multiple times, giving (n^2)/2 iterations. Using vectors will make this an O(1) operation.
You mention that the three coordinates are split out for memory caching purposes, but this will affect the Level 1 and Level 2 cache locality as you are accessing 3 different areas in memory. The linked list is also impacting your cache locality.
Using something like:
struct vector3d {
double x;
double y;
double z;
};
std::vector<vector3d> spin;
std::vector<vector3d> total_spin;
This should improve the cache locality, as the x, y and z values are adjacent in memory and the spins occupy a linear block of memory.
I feel following suggestions can help you optimize the code a bit if not completely:
Use initialization over assignments wherever possible
Prefer pre-increment over post for better speed.(believe me, it does make a change)
Apart from that I think the code is just fine. There are some pros and cons of each DS..you gotta live with it.
Happy Coding!

OpenMP and memory bandwidth restriction

Edit: My first code sample was wrong. Fixed with a simpler.
I implement a C++ library for algebraic operations between large vectors and matrices.
I found on x86-x64 CPUs that OpenMP parallel vector additions, dot product etc are not going so faster than single threaded.
Parallel operations are -1% - 6% faster than single threaded.
This happens because of memory bandwidth limitation (I think).
So, the question is, is there real performance benefit for code like this:
void DenseMatrix::identity()
{
assert(height == width);
size_t i = 0;
#pragma omp parallel for if (height > OPENMP_BREAK2)
for(unsigned int y = 0; y < height; y++)
for(unsigned int x = 0; x < width; x++, i++)
elements[i] = x == y ? 1 : 0;
}
In this sample there is no serious drawback from using OpenMP.
But if I am working on OpenMP with Sparse Vectors and Sparse Matrices, I cannot use for instance *.push_back(), and in that case, question becomes serious. (Elements of sparse vectors are not continuous like dense vectors, so parallel programming has a drawback because result elements can arrive anytime - not for lower to higher index)
I don't think this is a problem of memory bandwidth. I see clearly a problem on r: r is accessed from multiple threads, which causes both data races and false sharing. False sharing can dramatically hurt your performance.
I'm wondering whether you can get even the correct answer, because there are data races on r. Did you get the correct answer?
However, the solution would be very simple. The operation conducted on r is reduction, which can be easily achieved by reduction clause of OpenMP.
http://msdn.microsoft.com/en-us/library/88b1k8y5(v=vs.80).aspx
Try to simply append reduction(+ : r) after #pragma omp parallel.
(Note: Additions on double are not commutative and associative. You may see some precision errors, or some differences with the result of the serial code.)

Where is the bottleneck in this code?

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).