OpenMP parallel for inside do-while - c++

I am trying to refactor a OpenMP-based program and encountered a terrible scalability issue. The following (obviously not very meaningful) OpenMP program seems to reproduce the problem. Of course, the tiny sample code can be rewritten as a nested for-loop and using collapse(2) almost perfect scalability can be achieved. However, the original program I am working on does not allow to do that.
Therefore, I am looking for a fix, the keeps the do-while structure. From my understanding, OpenMP should be smart enough to keep the threads alive between the iterations and I expected good scalability. Why is this not the case?
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
int iter = 0;
do {
#pragma omp parallel for reduction(max:max) schedule(static)
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < MAX_ITER);
printf("max=%f\n", max);
}
I have measured the following runtimes with Cray compiler Version 8.3.4.
OMP_NUM_THREADS=1 : 0m21.535s
OMP_NUM_THREADS=2 : 0m12.191s
OMP_NUM_THREADS=4 : 0m9.610s
OMP_NUM_THREADS=8 : 0m9.767s
OMP_NUM_THREADS=16: 0m13.571s
This seems to be similar to this question. Thanks in advance. Help is appreciated! :)

Your could go for something like this:
#include <stdio.h>
#include <float.h>
#include <omp.h>
#define MAX( a, b ) ((a)>(b))?(a):(b)
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
#pragma omp parallel reduction( max : max )
{
int iter = 0;
int nbth = omp_get_num_threads();
int tid = omp_get_thread_num();
int myMaxIter = MAX_ITER / nbth;
if ( tid < MAX_ITER % nbth ) myMaxIter++;
int chunk = N / nbth;
do {
#pragma omp for schedule(dynamic,chunk) nowait
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < myMaxIter);
}
printf("max=%f\n", max);
}
I'm pretty sure scalability should improve notoriously.
NB: I had to come back to this a few times since I realised that the number of iterations for the outer loop (the do-while one) being potentially different for the different threads, it was of crucial importance that the scheduling of the omp for loop wasn't static, otherwise, there was a potential for deadlock at the last iteration.
I did a few tests and I think that the proposed solution is both safe and effective.

Related

C++ OpenMP: nested loops where the inner iterator depends on the outer one

Consider the following code:
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <cmath>
#include <omp.h>
using namespace std;
typedef std::chrono::steady_clock myclock;
double measure_time(myclock::time_point begin, myclock::time_point end)
{
return std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/(double)1e6;
}
int main()
{
int n = 20000;
vector<double> v(n);
iota(v.begin(), v.end(), 1.5);
vector< vector<double> > D(n, vector<double>(n,0.0));
myclock::time_point begin, end;
begin = myclock::now();
//#pragma omp parallel for collapse(2)
//#pragma omp parallel for
for(size_t i = 0; i < n - 1; i++){
for(size_t j = i+1; j < n; j++){
double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
D[i][j] = d;
D[j][i] = d;
}
}
end= myclock::now();
double time = measure_time(begin, end);
cout<<"Time: "<<time<<" (s)"<<endl;
return 0;
}
For compiling:
g++ -std=c++11 -fopenmp -o main main.cpp
I obtained the following run time:
With #pragma omp parallel for collapse(2): 7.9425 (s)
With #pragma omp parallel for: 3.73262 (s)
Without OpenGM: 11.0935 (s)
System settings: Linux Mint 18.3 64-bit, g++ 5.4.0, quad-core processor.
I would expect the first to be faster than the second (which parallelizes only the outer loop) and much faster than the third.
What did I do wrong please? The first and the second both ran on all 8 threads.
Thank you in advance for your help!
The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.
In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).
for(size_t k=0; k<n*(n-1)/2; k++) {
size_t i = k/n, j = k%n;
if(j<=i) i = n - i - 2, j = n - j - 1;
double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
D[i][j] = d;
D[j][i] = d;
}
I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.
As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

Same program execution time on different thread numbers openMP

I have a c++ program that multiplies 2 matrixes. I have to use openMP. This is what I have so far. https://pastebin.com/wn0AXFBG
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
int n = 1;
int Matrix1[1000][100];
int Matrix2[100][2];
int Matrix3[1000][2];
int sum = 0;
ofstream fr("rez.txt");
double t1 = omp_get_wtime();
omp_set_num_threads(n);
#pragma omp parallel for collapse(2) num_threads(n)
for ( int i = 0; i < 10; i++) {
for ( int j = 0; j < 10; j++) {
Matrix1[i][j] = i * j;
}
}
#pragma omp simd
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 2; j++) {
int t = rand() % 100;
if (t < 50) Matrix2[i][j] = -1;
if (t >= 50) Matrix2[i][j] = 1;
}
}
#pragma omp parallel for collapse(3) num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}
double t2 = omp_get_wtime();
double time = t2 - t1;
fr << time;
return 0;
}
The problem is that I get the same execution times whether I use 1 thread or 8. Pictures of timing added.
I have to show that the time is reduced near to 8 times. I am using the Intel C++ compiler with openMP turned on. Please advise.
First of all, I think, there is a small bug in your program, when you are initializing entries in matrix 1 as Matrix1[i][j] = i * j. The i and j are not going upto 1000 and 100 respectively.
Also, I am not sure if your computer actually supports 8 logical cores or not,
If there are no 8 logical cores then your computer will create 8 threads and one logical core will context switch more than one threads and thus will bring the performance down and thus, high execution time. So be sure about how many actual logical cores are available and specify less than or equal to that amount of cores to num_threads()
Now coming to the question, collapse clause fuses all the loops into one and tries to dynamically schedule that fused loop among p processors. I am not sure about how it deals with the race condition handling, but if you try to parallelize innermost loop without fusing all 3 loops, there is race condition as each thread will try to concurrently update Matrix3[ci][cj] and some kind of synchronization mechanism maybe atomic or reduction clause are needed to ensure correctness.
I am pretty sure that you can parallelize outer loop without any kind of race condition and also get a speedup near the number of processors you have employed (Again, as far as number of processors are less than or equal to number of logical cores) and I would suggest changing segment of your code as below.
// You can also use this function to set number of threads:
// omp_set_num_threads(n);
#pragma omp parallel for num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?
Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

No speedup with OpenMP

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup.
Unfortunately I noticed that I could not get the desired speedup.
So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.
This is the toy example i wrote:
#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
int main () {
int number_of_threads = 1;
int n = 600;
int m = 50;
int N = n/number_of_threads;
int time_limit = 600;
double total_clock = omp_get_wtime();
int time_flag = 0;
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++){
CD[c] = C[z]*D[x];
C[z] = CD[c] + D[x];
}
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
#pragma omp critical
std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
}
}
I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).
However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.
Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?
Results
Running the code above I got the following results.
1 thread: 23 iterations.
20 threads: 397-401 iterations per thread (instead of 420-460).
Your measurement methodology is wrong. Especially for small number of iterations.
1 thread: 3 iterations.
3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.
2 threads: 5 iterations per thread (instead of 6).
4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.
20 threads: 40-44 iterations per thread (instead of 60).
40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.
As you can see your results actually do not contradict linear speedup.
It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.
Some reasons (non exhaustive) why you don't see linear speedup are:
Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
Synchronization between threads (e.g. critical sections). Not the case in your toy example.
Load imbalance between threads. Not the case in your toy example.
Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.
From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.
More general advise would be too broad for this format and require more specific information about the non-toy example.
You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
}
Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.
A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or
#pragma omp parallel
{
#pragma omp for
{
}
}
this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster.
be careful
#pragma omp parallel
{
for
{
}
}
will make the entire for loop for each thread, which will not speed up your program.
You should try
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
#pragma omp for
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++)
CD[c] = C[z]*D[x];
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
if(thread_id == 0)
iteration_number = iteration_number_local;
}
std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

OpenMP code C++ is slower thatn c++

i have the following part of code, i run it on sample of N=3000, the c++ sequential code is faster by 3 seconds which is not good at all.
this code is filling the array jsd[N] with calculated values and i want to locate the maximum value and its location.
so
1- is this openmp conversion correct, and is there any better suggstion to make it more profissional
2- why it is slower that the equavilant c++ code, also the more threads i create the more it get slow.
thanks in advance
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i] = obj.function2(H, i + 1, N, Hl, Hr);
if (jsd[i] >= maxval) {
#pragma omp critical
{
maxval = jsd[i];
pos = i;
}
}
} // for
update:
here is the new code but still slow and get slower in more threads.
i update the code as following. but still get slower for more threads
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel num_threads(50)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i]= obj.function2(H, i + 1, N, Hl, Hr);
} // for
#pragma omp master
{
vector<double> jsd2 (jsd,jsd+N);
vector<double>::iterator jsditer;
jsditer = std::max_element(jsd2.begin(), jsd2.end());
maxval=*jsditer;
pos=std::distance(jsd2.begin(),jsditer) ;
// cout<<"pos"<<pos<<endl;
}
#pragma omp barrier
The first optimization I would suggest is to first compute all jsd values in the loop, then find the maximum element via std::max_element().
This way you are not forcing the threads to synchronise.
The second thing I would do is move over to Intel TBB instead of OpenMP and use parallel_reduce().
But the biggest question is, how complex are the objective functions you are evaluating.