No speedup with OpenMP

No speedup with OpenMP - c++

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup.
Unfortunately I noticed that I could not get the desired speedup.
So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.
This is the toy example i wrote:
#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
int main () {
int number_of_threads = 1;
int n = 600;
int m = 50;
int N = n/number_of_threads;
int time_limit = 600;
double total_clock = omp_get_wtime();
int time_flag = 0;
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++){
CD[c] = C[z]*D[x];
C[z] = CD[c] + D[x];
}
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
#pragma omp critical
std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
}
}
I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).
However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.
Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?
Results
Running the code above I got the following results.
1 thread: 23 iterations.
20 threads: 397-401 iterations per thread (instead of 420-460).

Your measurement methodology is wrong. Especially for small number of iterations.
1 thread: 3 iterations.
3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.
2 threads: 5 iterations per thread (instead of 6).
4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.
20 threads: 40-44 iterations per thread (instead of 60).
40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.
As you can see your results actually do not contradict linear speedup.
It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.
Some reasons (non exhaustive) why you don't see linear speedup are:
Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
Synchronization between threads (e.g. critical sections). Not the case in your toy example.
Load imbalance between threads. Not the case in your toy example.
Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.
From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.
More general advise would be too broad for this format and require more specific information about the non-toy example.

You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
}
Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.
A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or
#pragma omp parallel
{
#pragma omp for
{
}
}
this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster.
be careful
#pragma omp parallel
{
for
{
}
}
will make the entire for loop for each thread, which will not speed up your program.

You should try
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
#pragma omp for
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++)
CD[c] = C[z]*D[x];
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
if(thread_id == 0)
iteration_number = iteration_number_local;
}
std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

Related

OpenMP. Parallelization of the loop in N threads

I am trying to parallelize a cycle of 50 million iterations with several threads - first by 1, then by 4, 8 and 16. Below is the code for implementing this functionality.
#include <iostream>
#include <omp.h>
using namespace std;
void someFoo();
int main() {
someFoo();
}
void someFoo() {
long sum = 0;
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
omp_set_num_threads(numOfThreads[j]);
start = omp_get_wtime();
#pragma omp parallel for
for(int i = 0; i<50000000; i++) {
sum += i * 10;
}
#pragma omp end parallel
end = omp_get_wtime();
cout << "Result: " << sum << ". Spent time: " << (end - start) << "\n";
}
}
It is expected that in 4 threads the program will run faster than in 1, in 8 threads faster than in 4 threads, and in 16 threads faster than in 8 threads, but in practice this is not the case - everything is executed at a chaotic speed and there is almost no difference. Also, it is not visible in the task manager that the program is parallelized. I have a computer with 8 logical processors and 4 cores.
Please tell me where I made a mistake and how to properly parallelize the loop in N threads.

There is a race condition in your code because sum is read/written from multiple threads at the same time. This should cause wrong results. You can fix this using a reduction with the directive #pragma omp parallel for reduction(+:sum). Note that OpenMP does not check if your loop can be parallelized, it is your responsibility.
Additionally, the parallel computation might be slower than the the sequential one since a clever compiler can see that sum = 50000000*(50000000-1)/2*10 = 12499999750000000 (AFAIK, Clang does that). As a result, the benchmark is certainly flawed. Note that this is certainly bigger than what the type long can contain so there is certainly an overflow in your code.
Moreover, AFAIK, there is no such thing as the directive #pragma omp end parallel.
Finally, note that you can control the number of thread using the OMP_NUM_THREADS environment variable which is generally more convenient than setting it in your application (Hardwiring a given number of thread in the application code is generally not a good idea, even for benchmarks).

Please tell me where I made a mistake and how to properly parallelize
the loop in N threads.
First you need to fix some of the compiler issues on your code example. Like removing pragmas like #pragma omp end parallel, declaring the variables correctly and so on. Second you need to fix the race condition during the update of the variable sum. That variable is shared among threads an updated concurrently. The easiest way would be to use the reduction clause of OpenMP, you code would look like the following:
#include <stdio.h>
#include <omp.h>
int main() {
someFoo();
}
void someFoo() {
int numOfThreads[] = {1, 4, 8, 16};
for(int j = 0; j < sizeof(numOfThreads) / sizeof(int); j++) {
omp_set_num_threads(numOfThreads[j]);
double start = omp_get_wtime();
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i = 0; i<50000000; i++) {
sum += i * 10;
}
double end = omp_get_wtime();
printf("Result: '%d' : '%f'\n", sum, (end - start));
}
}
With that you should get some speedups when running with multi-cores.
NOTE: To solve the overflow mentioned first by #Jérôme Richard, I change the 'sum' variable from long to a double.

Same program execution time on different thread numbers openMP

I have a c++ program that multiplies 2 matrixes. I have to use openMP. This is what I have so far. https://pastebin.com/wn0AXFBG
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
int n = 1;
int Matrix1[1000][100];
int Matrix2[100][2];
int Matrix3[1000][2];
int sum = 0;
ofstream fr("rez.txt");
double t1 = omp_get_wtime();
omp_set_num_threads(n);
#pragma omp parallel for collapse(2) num_threads(n)
for ( int i = 0; i < 10; i++) {
for ( int j = 0; j < 10; j++) {
Matrix1[i][j] = i * j;
}
}
#pragma omp simd
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 2; j++) {
int t = rand() % 100;
if (t < 50) Matrix2[i][j] = -1;
if (t >= 50) Matrix2[i][j] = 1;
}
}
#pragma omp parallel for collapse(3) num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}
double t2 = omp_get_wtime();
double time = t2 - t1;
fr << time;
return 0;
}
The problem is that I get the same execution times whether I use 1 thread or 8. Pictures of timing added.
I have to show that the time is reduced near to 8 times. I am using the Intel C++ compiler with openMP turned on. Please advise.

First of all, I think, there is a small bug in your program, when you are initializing entries in matrix 1 as Matrix1[i][j] = i * j. The i and j are not going upto 1000 and 100 respectively.
Also, I am not sure if your computer actually supports 8 logical cores or not,
If there are no 8 logical cores then your computer will create 8 threads and one logical core will context switch more than one threads and thus will bring the performance down and thus, high execution time. So be sure about how many actual logical cores are available and specify less than or equal to that amount of cores to num_threads()
Now coming to the question, collapse clause fuses all the loops into one and tries to dynamically schedule that fused loop among p processors. I am not sure about how it deals with the race condition handling, but if you try to parallelize innermost loop without fusing all 3 loops, there is race condition as each thread will try to concurrently update Matrix3[ci][cj] and some kind of synchronization mechanism maybe atomic or reduction clause are needed to ensure correctness.
I am pretty sure that you can parallelize outer loop without any kind of race condition and also get a speedup near the number of processors you have employed (Again, as far as number of processors are less than or equal to number of logical cores) and I would suggest changing segment of your code as below.
// You can also use this function to set number of threads:
// omp_set_num_threads(n);
#pragma omp parallel for num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?

Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

OpenMP parallel for inside do-while

I am trying to refactor a OpenMP-based program and encountered a terrible scalability issue. The following (obviously not very meaningful) OpenMP program seems to reproduce the problem. Of course, the tiny sample code can be rewritten as a nested for-loop and using collapse(2) almost perfect scalability can be achieved. However, the original program I am working on does not allow to do that.
Therefore, I am looking for a fix, the keeps the do-while structure. From my understanding, OpenMP should be smart enough to keep the threads alive between the iterations and I expected good scalability. Why is this not the case?
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
int iter = 0;
do {
#pragma omp parallel for reduction(max:max) schedule(static)
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < MAX_ITER);
printf("max=%f\n", max);
}
I have measured the following runtimes with Cray compiler Version 8.3.4.
OMP_NUM_THREADS=1 : 0m21.535s
OMP_NUM_THREADS=2 : 0m12.191s
OMP_NUM_THREADS=4 : 0m9.610s
OMP_NUM_THREADS=8 : 0m9.767s
OMP_NUM_THREADS=16: 0m13.571s
This seems to be similar to this question. Thanks in advance. Help is appreciated! :)

Your could go for something like this:
#include <stdio.h>
#include <float.h>
#include <omp.h>
#define MAX( a, b ) ((a)>(b))?(a):(b)
int main() {
const int N = 6000;
const int MAX_ITER = 2000000;
double max = DBL_MIN;
#pragma omp parallel reduction( max : max )
{
int iter = 0;
int nbth = omp_get_num_threads();
int tid = omp_get_thread_num();
int myMaxIter = MAX_ITER / nbth;
if ( tid < MAX_ITER % nbth ) myMaxIter++;
int chunk = N / nbth;
do {
#pragma omp for schedule(dynamic,chunk) nowait
for(int i = 1; i < N; ++i) {
max = MAX(max, 3.3*i);
}
++iter;
} while(iter < myMaxIter);
}
printf("max=%f\n", max);
}
I'm pretty sure scalability should improve notoriously.
NB: I had to come back to this a few times since I realised that the number of iterations for the outer loop (the do-while one) being potentially different for the different threads, it was of crucial importance that the scheduling of the omp for loop wasn't static, otherwise, there was a potential for deadlock at the last iteration.
I did a few tests and I think that the proposed solution is both safe and effective.

Parallel program using openMP

I am trying to calculate the integral of 4/(1+x^2) from 0 to 1 in c++ with multi-threading using openMP.
I took a serial program (which is correct) and changed it.
My idea is:
Assume that X is the number of threads.
Divide the area beneath the function into X parts, first from 0 to 1/X, 1/X to 2/X...
Each thread will calculate it's area, and I will sum it all up.
This is how I implemented it:
`//N.o. of threads to do the task
cout<<"Enter num of threads"<<endl;
int num_threads;
cin>>num_threads;
int i; double x,pi,sum=0.0;
step=1.0/(double)num_steps;
int steps_for_thread=num_steps/num_threads;
cout<<"Steps for thread : "<<steps_for_thread<<endl;
//Split to threads
omp_set_num_threads(num_threads);
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
thread_id++;
if (thread_id == 1)
{
double sum1=0.0;
double x1;
for(i=0;i<num_steps/num_threads;i++)
{
x1=(i+0.5)*step;
sum1 = sum1+4.0/(1.0+x1*x1);
}
sum+=sum1;
}
else
{
double sum2=0.0;
double x2;
for(i=num_steps/thread_id;i<num_steps/(num_threads-thread_id+1);i++)
{
x2=(i+0.5)*step;
sum2 = sum2+4.0/(1.0+x2*x2);
}
sum+=sum2;
}
} '
Explanation:
The i'th thread will calculate the area between i/n to (i+1)/n and add it to the sum.
The problem is that not only that the output is wrong, but also each time I run the program I get different output.
Any help will be welcomed
Thanks

You're making this problem much harder than it needs to be. One of OpenMP's goals is to not have to change your serial code. You usually only need to add some pragma statements. So you should write the serial method first.
#include <stdio.h>
double pi(int n) {
int i;
double dx, sum, x;
dx = 1.0/n;
#pragma omp parallel for reduction(+:sum) private(x)
for(i=0; i<n; i++) {
x = i*dx;
sum += 1.0/(1+x*x);
}
sum *= 4.0/n;
return sum;
}
int main(void) {
printf("%f\n",pi(100000000));
}
Output: 3.141593
Notice that in the function pi the only difference between the serial code and the parallel version is the statement
#pragma omp parallel for reduction(+:sum) private(x)
You should also not normally worry about setting the number of threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js