Parallelization of nested loops with OpenMP - c++

I was trying to parallelize the following loop in my code with OpenMP
double pottemp,pot2body;
pot2body=0.0;
pottemp=0.0;
#pragma omp parallel for reduction(+:pot2body) private(pottemp) schedule(dynamic)
for(int i=0;i<nc2;i++)
{
pottemp=ener2body[i]->calculatePot(ener2body[i]->m_mols);
pot2body+=pottemp;
}
For function 'calculatePot', a very important loop inside this function has also been parallelized by OpenMP
CEnergymulti::calculatePot(vector<CMolecule*> m_mols)
{
...
#pragma omp parallel for reduction(+:dev) schedule(dynamic)
for (int i = 0; i < i_max; i++)
{
...
}
}
So it seems that my parallelization involves nested loops. When I removed the parallelization of the outmost loop,
it seems that the program runs much faster than the one with outmost loop parallelized. The test was performed on 8 cores.
I think this low efficiency of parallelization might be related to nested loops. Someone suggests me using 'collapse' while parallelizing the outmost loop. However, since there are still something between the outmost loop and the inner loop, it was said 'collapse' cannot be used under this circumstance. Are there any other ways I could try to make this parllelization more efficient while still using OpenMP?
Thanks a lot.

If i_max is independent of the i in the outerloop you can try fusing the loops (essentially collapse). It's something I do often which often gives me a small boost. I also prefer fusing the loops "by hand" rather than with OpenMP because Visual Studio only supports OpenMP 2.0 which does not have collapse and I want my code to work on Windows and Linux.
#pragma omp parallel for reduction(+:pot2body) schedule(dynamic)
for(int n=0; n<(nc2*i_max); n++) {
int i = n/i_max; //i from outer loop
int j = n%i_max; //i from inner loop
double pottmp_j = ...
pot2body += pottmp_j;
}
If i_max depends on j then this won't work. In that case follow Grizzly's advice. But one more thing to you can try. OpenMP has an overhead. If i_max is too small then using OpenMP could actually be slower. If you add an if clause at the end of the pragma then OpenMP will only run if the statement is true. Like this:
const int threshold = ... // smallest value for which OpenMP gives a speedup.
#pragma omp parallel for reduction(+:dev) schedule(dynamic) if(i_max > threshold)

Related

Performance of matrix multiplications remains unchanged with OpenMP in C++

auto t1 = chrono::steady_clock::now();
#pragma omp parallel
{
for(int i=0;i<n;i++)
{
#pragma omp for collapse(2)
for(int j=0;j<n;j++)
{
for(int k=0;k<n;k++)
{
C[i][j]+=A[i][k]*B[k][j];
}
}
}
}
auto t2 = chrono::steady_clock::now();
auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
With and without the parallelization the variable t remains fairly constant. I am not sure why this is happening. Also once in a while t is outputted as 0.
One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100)
I am using code::blocks with the GNU GCC compiler.
The proposed OpenMP parallelization is not correct and may lead to wrong results. When specifying collapse(2), threads execute "simultaneously" the (j,k) iterations. If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j]. This is a so called race condition, i.e. "two or more threads can access shared data and they try to change it at the same time" (What is a race condition?). Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). To fix the problem in the code above, OpenMP offers the reduction clause:
#pragma omp parallel
{
for(int i=0;i<n;i++) {
#pragma omp for collapse(2) reduction(+:C)
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp).
However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all:
#pragma omp parallel
{
#pragma omp for collapse(2)
for(int i=0;i<n;i++) {
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads.
On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. For clarity I report the performances I measured:
Serial 418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1 1924950 2841993 1450686 1455989
OpenMP-2 988743 2446098 747333 745830
OpenMP-4 515266 3182262 396524 387671
OpenMP-8 280285 5510023 219506 211913
OpenMP-16 2227567 10807828 150277 123368
Using reduction the performance loss is dramatic (reversed speed-up). The outer parallelization (w or w/o collapse) is the best option.
As concerns your failure with large matrices, a possible reason is related to the size of the available stack. Try to enlarge both the system and OpenMP stack sizes, i.e.
ulimit -s unlimited
export OMP_STACKSIZE=10000000
The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations.
Did you try without collapse?

Situations faced in OpenMP on for() loops

I'm using OpenMP for this and I'm not confident of my answer as well. Really need your help in this. I've been wondering which method (serial or parallel) is faster in run speed in this. My #pragma commands (set into comments) are shown below.
Triangle Triangle::t_ID_lookup(Triangle a[], int ID, int n)
{
Triangle res; int i;
//#pragma omp for schedule(static) ordered
for(i=0; i<n; i++)
{
if(ID==a[i].t_ID)
{
//#pragma omp ordered
return (res=a[i]); // <-changed into "res = a[i]" instead of "return(...)"
}
}
return res;
}
It depends on n. If n is small, then the overhead required for the OMP threads makes the OMP version slower. This can be overcome by adding an if clause: #pragma omp parallel if (n > YourThreshhold)
If all a[i].t_ID are not unique, then you may receive different results from the same data when using OMP.
If you have more in your function than just a single comparison, consider adding a shared flag variable that would indicate that it was found so that a comparison if(found) continue; can be added at the beginning of the loop.
I have no experience with ordered, so if that was the crux of your question, ignore all the above and consider this answer.
Profile. In the end, there is no better answer.
If you still want a theoretical answer, then a random lookup would be O(n) with a mean of n/2 while the OMP version would be a constant n/k where k is the number of threads/cores not including overhead.
For an alternative way of writing your loop, see Z Boson's answer to a different question.

Parellize some nested for in openmp c++

My serial code for the convolution between a matrix and a kernel works like this:
int index1, index2, a, b;
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
for(int i=0;i<krows;++i){
for(int j=0;j<kcolumns;++j){
a=x+i-krows/2;
b=y+j-kcolumns/2;
if(a<0)
index1=rows+a;
else if(a>rows-1)
index1=a-rows;
else
index1=a;
if(b<0)
index2=columns+b;
else if(b>columns-1)
index2=b-columns;
else
index2=b;
output[x*columns+y]+=input[index1*columns+index2]*kernel[i*kcolumns+j];
}
}
}
}
The convolution considers cyclic treatment for the borders. Now I want to parallelize the code with openmp. I thought about reducing the first two for-cycles to just one and using the syntax:
#pragma omp parallel
#pragma omp for private(x,y,a, b, index1, index2)
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
...
I see that parallelizing like that it reduces the cpu-time but I'm not a big expert of openmp so I was asking myself if there are other more efficient solutions. I don't think it is a good idea to parallelize also the others 2 nested for-cycles.
With an input matrix of dimensions 1000*10000 and a square kernel matrix 9*9 I obtain these times:
4823 ms for 1 thread
2696 ms for 2 threads
2513 ms for 4 threads.
I hope someone can give me some useful suggestions. What about the for reduction syntax?
My suggestion is to change approach altogether. If you are using cyclic treatment for the border (i.e. your problem is periodic) the fast way to do it is based on the fft-based spectral approach:
-Fourier transform matrix and kernel
-compute the product
-Inverse fourier transform the product (you have the convolution)
This is (1) much more efficient (unless the dimensions of the kernel are much smaller than those of the matrix) and (2) you can use a fft library that supports multithreading (like FFTW) and let it deal with it.
You don't need to change the for loops. You can make each thread iterate thru all rows in a column or thru all columns in a row. Also, bear in mind that if the number of threads is higher than the number of physical cores, the performance won't change much.
OpenMP already takes care of the number of threads that it should create, using the logical cores count - which might be a problem on Intel i3 and i7, since they have hyperthreading and thus the performance gain per extra thread won't be big.
In resume, you can either:
#pragma omp parallel for private (x,y,a,b,index1,index2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
// ...
}
}
Or:
for(int x=0;x<rows;++x){
#pragma omp parallel for private (y,a,b,index1,index2)
for(int y=0;y<columns;++y){
// ...
}
}
If you are using OpenMP 3.0 or greater you may exploit the collapse clause of the loop work-sharing construct:
The collapse clause may be used to specify how many loops are
associated with the loop construct. The parameter of the collapse
clause must be a constant positive integer expression. If no collapse
clause is present, the only loop that is associated with the loop
construct is the one that immediately follows the loop directive
This means that you may write the following:
#pragma omp parallel for collapse(2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
/* Work here */
}
}
and obtain exactly the same result as your linearized loop:
#pragma omp parallel for
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
/* Work here */
}
As you may see, with the collapse clause no modification is needed to your serial code and you may easily experiment further loop collapsing changing the positive number in the clause.

OpenMP Performance impact: private directive vs. declaring variable inside for construct

Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.

C++ OpenMP directives for a parallel for loop?

I am trying OpenMP on a particular code snippet. Not sure if the snippet needs a revamp, perhaps it is set up too rigidly for sequential implementation. Anyway here is the (pseudo-)code that I'm trying to parallelize:
#pragma omp parallel for private(id, local_info, current_local_cell_id, local_subdomain_size) shared(cells, current_global_cell_id, global_id)
for(id = 0; id < grid_size; ++id) {
local_info = cells.get_local_subdomain_info(id);
local_subdomain_size = local_info.size();
...do other stuff...
do {
current_local_cell_id = cells.get_subdomain_cell_id(id);
global_id.set(id, current_global_cell_id + current_local_cell_id);
} while(id < local_subdomain_size && ++id);
current_global_cell_id += local_subdomain_size;
}
This makes complete sense (after staring at it for some time) in a sequential sense, which also might mean that it needs to be re-written for OpenMP. My concern is that current_local_cell_id and local_subdomain_size are private, but current_global_cell_id and global_id are shared.
Hence the statement current_global_cell_id += local_subdomain_size after the inner loop:
do {
...
} while(...)
current_global_cell_id += local_subdomain_size;
might lead to errors in the OpenMP setting, I suspect. I would greatly appreciate if any of the OpenMP experts out there can provide some pointers on any of the special OMP directives I can use to make minimum changes to the code but still avail of OpenMP for such a type of for loop.
I'm not sure I understand your code. However, I think you really want some kind of parallel accumulation.
You could use a pattern like
size_t total = 0;
#pragma omp parallel for shared(total) reduction (+:total)
for (int i=0; i<MAXITEMS; i++)
{
total += getvalue(i); // TODO replace with your logic
}
// total has been 'magically' combined by OMP
On a related note, when you use gcc you can just use the __gnu_parallel::accumulate drop-in replacement for std::accumulate, which does exactly the same. See Chapter 18. Parallel Mode
size_t total = __gnu_parallel::accumulate(c.begin(), c.end(), 0, &myvalue_accum);
You can even compile with -D_GLIBCXX_PARALLEL which will make all use of std algorithms automatically parallellized if possible. Don't use that unless you know what you're doing! Frequently, performance just suffers and the chance of introducing bugs due to unexpected parallelism is real
changing id inside the loop is not correct. There is no way to dispatch the loop to different thread, as loop step does not produce a predictable id value.
Why are you using the id inside that do while loop?