C++ OpenMP directives for a parallel for loop? - c++

I am trying OpenMP on a particular code snippet. Not sure if the snippet needs a revamp, perhaps it is set up too rigidly for sequential implementation. Anyway here is the (pseudo-)code that I'm trying to parallelize:
#pragma omp parallel for private(id, local_info, current_local_cell_id, local_subdomain_size) shared(cells, current_global_cell_id, global_id)
for(id = 0; id < grid_size; ++id) {
local_info = cells.get_local_subdomain_info(id);
local_subdomain_size = local_info.size();
...do other stuff...
do {
current_local_cell_id = cells.get_subdomain_cell_id(id);
global_id.set(id, current_global_cell_id + current_local_cell_id);
} while(id < local_subdomain_size && ++id);
current_global_cell_id += local_subdomain_size;
}
This makes complete sense (after staring at it for some time) in a sequential sense, which also might mean that it needs to be re-written for OpenMP. My concern is that current_local_cell_id and local_subdomain_size are private, but current_global_cell_id and global_id are shared.
Hence the statement current_global_cell_id += local_subdomain_size after the inner loop:
do {
...
} while(...)
current_global_cell_id += local_subdomain_size;
might lead to errors in the OpenMP setting, I suspect. I would greatly appreciate if any of the OpenMP experts out there can provide some pointers on any of the special OMP directives I can use to make minimum changes to the code but still avail of OpenMP for such a type of for loop.

I'm not sure I understand your code. However, I think you really want some kind of parallel accumulation.
You could use a pattern like
size_t total = 0;
#pragma omp parallel for shared(total) reduction (+:total)
for (int i=0; i<MAXITEMS; i++)
{
total += getvalue(i); // TODO replace with your logic
}
// total has been 'magically' combined by OMP
On a related note, when you use gcc you can just use the __gnu_parallel::accumulate drop-in replacement for std::accumulate, which does exactly the same. See Chapter 18. Parallel Mode
size_t total = __gnu_parallel::accumulate(c.begin(), c.end(), 0, &myvalue_accum);
You can even compile with -D_GLIBCXX_PARALLEL which will make all use of std algorithms automatically parallellized if possible. Don't use that unless you know what you're doing! Frequently, performance just suffers and the chance of introducing bugs due to unexpected parallelism is real

changing id inside the loop is not correct. There is no way to dispatch the loop to different thread, as loop step does not produce a predictable id value.
Why are you using the id inside that do while loop?

Related

how to optimize this eigen library code via openMP or by any other means

I wanted to parallelize and optimize my code which uses eigen library but i am struck due to this situation
A part of code which take 2-3 secs.in one iterations is repeatedly runs many times due to a while loop, i am unable to use OpenMP on while loop but when using on part of code it shows no optimization.
code structure -
while(error>1e-6){
//some code ...
//part that i want to optimize
#pragma omp for
for(int i=0; i<18; i++)
{
XFG_e.coeffRef(IDOF(i)-1) += XFE(i);
XFG_i.coeffRef(IDOF(i)-1) += XFI(i);
for (int j=0;j<18;j++)
{
XKG.coeffRef(IDOF(i)-1,IDOF(j)-1) += XKT(i,j);
XMG.coeffRef(IDOF(i)-1,IDOF(j)-1) += XME(i,j);
}
}
}
please suggest ways to optimize this code... any better technique of using of openMP ,any library optimization ,other library alternatives etc..
That bit of code is first of all tiny: 18-squared iterations. The barrier at the end of the parallel loop may very well be more expensive than the operations.
Next, that bit is written the wrong way for omp parallelization:
x(i+1) += y(i)
x(i-1) += y(i)
has no guarantees that the left hand sides are disjoint, so you will have parallel write conflicts. You first have to rewrite that as
x(i) += y(i-1) + y(i+1)
But first ask yourself, what does this code do? From the intended algorithm, would it be possible to have obviously-disjoint left hand sides under some rewrite?

parallel programming in OpenMP

I have the following piece of code.
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
where offset is an array of size n containing values in the range [0, m) and cnt is an array of size m initialized to 0. I use OpenMP to parallelize it as follows.
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
According to the discussion in this post, if offset[i1] == offset[i2] for i1 != i2, the above piece of code may result in incorrect cnt. What can I do to avoid this?
This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by #Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.

Is there any difference between variables in a private clause and variables defined within a parallel region in OpenMP?

I was wondering if there is any reason for preferring the private(var) clause in OpenMP over the local definition of (private) variables, i.e.
int var;
#pragma omp parallel private(var)
{
...
}
vs.
#pragma omp parallel
{
int var;
...
}
Also, I'm wondering what's the point of private clauses then. This question was partially explained in OpenMP: are local variables automatically private?, but I don't like the answer since even C89 doesn't bar you from defining variables in the middle of functions as long as they are in the beginning of a scope (which is automatically the case when you enter a parallel region). So even for old-fashion C-programmers this shouldn't make any difference.
Should I consider this as syntactic sugar that allows for a "define-variables-in-the-beginning-of-your-function" style as used in the good old days?
By the way: In my opinion the second version also prevents programmers from using private variables after the parallel region in the hope that it may contain something useful so another -1 for the private clause.
But since I'm quite new to OpenMP I don't want to question anything without having a good explanation for it. Thanks in advance for your answers!
It's not just syntactic sugar. One of the features of OpenMP strives for is to not change the serial code if the code is not compiled with OpenMP. Any construct you use as part of a pragma is ignored if you don't compile with OpenMP. Doing this you can use things like private, firstprivaate, collapse, and parallel for without changing your code. Changing the code can affect for example how the code is optimized by the compiler.
If you have code like
int i,j;
#pragma omp parallel for private(j)
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
}
}
The only way to do this without private in C89 is to change the code by defining j inside the parallel section e.g:
int i,j;
#pragma omp parallel
{
int j;
#pragma omp for
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
}
}
}
Here's a C++ example with firstprivate. Let's say you have a vector which you want to be private. If you use firstprivate you don't have to change your code but if you declare a private copy inside the parallel region you do change your code. If you compile that without OpenMP it makes a unnecessary copy.
vector<int> a;
#pragma omp parallel {
vector<int> a_private = a;
}
This logic applies to many other constructs. For example collapse. You could manually fuse a loop which changes your code or you can use collapse and only fuse it when compiled with OpenMP.
However, having said all that, in practice I find often that I need to change the code anyway to get the best parallel result so I usually define everything in parallel sections anyway and don't use features such as private, firstprivate, or collapse (not to mention that OpenMP implementations in C++ often struggle with non-POD anyway so it's often better to do it yourself).

Parallelization of nested loops with OpenMP

I was trying to parallelize the following loop in my code with OpenMP
double pottemp,pot2body;
pot2body=0.0;
pottemp=0.0;
#pragma omp parallel for reduction(+:pot2body) private(pottemp) schedule(dynamic)
for(int i=0;i<nc2;i++)
{
pottemp=ener2body[i]->calculatePot(ener2body[i]->m_mols);
pot2body+=pottemp;
}
For function 'calculatePot', a very important loop inside this function has also been parallelized by OpenMP
CEnergymulti::calculatePot(vector<CMolecule*> m_mols)
{
...
#pragma omp parallel for reduction(+:dev) schedule(dynamic)
for (int i = 0; i < i_max; i++)
{
...
}
}
So it seems that my parallelization involves nested loops. When I removed the parallelization of the outmost loop,
it seems that the program runs much faster than the one with outmost loop parallelized. The test was performed on 8 cores.
I think this low efficiency of parallelization might be related to nested loops. Someone suggests me using 'collapse' while parallelizing the outmost loop. However, since there are still something between the outmost loop and the inner loop, it was said 'collapse' cannot be used under this circumstance. Are there any other ways I could try to make this parllelization more efficient while still using OpenMP?
Thanks a lot.
If i_max is independent of the i in the outerloop you can try fusing the loops (essentially collapse). It's something I do often which often gives me a small boost. I also prefer fusing the loops "by hand" rather than with OpenMP because Visual Studio only supports OpenMP 2.0 which does not have collapse and I want my code to work on Windows and Linux.
#pragma omp parallel for reduction(+:pot2body) schedule(dynamic)
for(int n=0; n<(nc2*i_max); n++) {
int i = n/i_max; //i from outer loop
int j = n%i_max; //i from inner loop
double pottmp_j = ...
pot2body += pottmp_j;
}
If i_max depends on j then this won't work. In that case follow Grizzly's advice. But one more thing to you can try. OpenMP has an overhead. If i_max is too small then using OpenMP could actually be slower. If you add an if clause at the end of the pragma then OpenMP will only run if the statement is true. Like this:
const int threshold = ... // smallest value for which OpenMP gives a speedup.
#pragma omp parallel for reduction(+:dev) schedule(dynamic) if(i_max > threshold)

OpenMP Performance impact: private directive vs. declaring variable inside for construct

Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.