Parallel program using openMP

Parallel program using openMP - c++

I am trying to calculate the integral of 4/(1+x^2) from 0 to 1 in c++ with multi-threading using openMP.
I took a serial program (which is correct) and changed it.
My idea is:
Assume that X is the number of threads.
Divide the area beneath the function into X parts, first from 0 to 1/X, 1/X to 2/X...
Each thread will calculate it's area, and I will sum it all up.
This is how I implemented it:
`//N.o. of threads to do the task
cout<<"Enter num of threads"<<endl;
int num_threads;
cin>>num_threads;
int i; double x,pi,sum=0.0;
step=1.0/(double)num_steps;
int steps_for_thread=num_steps/num_threads;
cout<<"Steps for thread : "<<steps_for_thread<<endl;
//Split to threads
omp_set_num_threads(num_threads);
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
thread_id++;
if (thread_id == 1)
{
double sum1=0.0;
double x1;
for(i=0;i<num_steps/num_threads;i++)
{
x1=(i+0.5)*step;
sum1 = sum1+4.0/(1.0+x1*x1);
}
sum+=sum1;
}
else
{
double sum2=0.0;
double x2;
for(i=num_steps/thread_id;i<num_steps/(num_threads-thread_id+1);i++)
{
x2=(i+0.5)*step;
sum2 = sum2+4.0/(1.0+x2*x2);
}
sum+=sum2;
}
} '
Explanation:
The i'th thread will calculate the area between i/n to (i+1)/n and add it to the sum.
The problem is that not only that the output is wrong, but also each time I run the program I get different output.
Any help will be welcomed
Thanks

You're making this problem much harder than it needs to be. One of OpenMP's goals is to not have to change your serial code. You usually only need to add some pragma statements. So you should write the serial method first.
#include <stdio.h>
double pi(int n) {
int i;
double dx, sum, x;
dx = 1.0/n;
#pragma omp parallel for reduction(+:sum) private(x)
for(i=0; i<n; i++) {
x = i*dx;
sum += 1.0/(1+x*x);
}
sum *= 4.0/n;
return sum;
}
int main(void) {
printf("%f\n",pi(100000000));
}
Output: 3.141593
Notice that in the function pi the only difference between the serial code and the parallel version is the statement
#pragma omp parallel for reduction(+:sum) private(x)
You should also not normally worry about setting the number of threads.

Related

How to calculate the sum of an array in parallel using C++ and OpenMP?

my task is to parallelize the creation, doubling, and summation of the array seen in my code below using C++ and OpenMP. However, I cannot get the summation to work in parallel properly. This is my first time using OpenMP, and I am also quite new to C++ as well. I have tried what can be seen in my code below as well as other variations (having the sum outside of the for loop, defining a sum in parallel to add to the global sum, I have tried what is suggested here, etc). The sum should be 4.15362e-14, but when I use multiple threads, I get different results each time that are incorrect. What is the proper way to achieve this?
P.S. We have only been taught the critical, master, barrier, and single constructs thus far so I would appreciate if answers would not include any others. Thanks!
#include <iostream>
#include <cmath>
#include <omp.h>
using namespace std;
int main()
{
const int size = 256;
double* sinTable = new double[256];
double sum = 0.0;
// parallelized
#pragma omp parallel
{
for (int n = 0; n < size; n++)
{
sinTable[n] = std::sin(2 * M_PI * n / size); // calculate and insert element into array
sinTable[n] = sinTable[n] * 2; // double current element in array
#pragma omp critical
sum += sinTable[n]; // add element to total sum (one thread at a time)
}
}
// print sum and exit
cout << "Sum: " << sum << endl;
return 0;
}

Unfortunately your code is not OK, because you run the for loop number of thread times instead of distributing the work. You should use:
#pragma omp parallel for
to distribute the work among threads.
Another alternative is to use reduction:
int main()
{
const int size = 256;
const double step = (2.0 * M_PI) / static_cast<double>(size);
double* sinTable = new double[size];
double sum = 0.0;
// parallelized
#pragma omp parallel for reduction(+:sum)
for (int n = 0; n < size; n++)
{
sinTable[n] = std::sin( static_cast<double>(n) * step); // calculate and insert element into array
sinTable[n] = sinTable[n] * 2.0; // double current element in array
sum += sinTable[n]; // add element to total sum (one thread at a time)
}
// print sum and exit
cout << "Sum: " << sum << endl;
delete[] sinTable;
return 0;
}
Note that in theory the sum should be zero. The value you obtain depends on the order of additions, so slight difference can be observed due to rounding errors.
size=256 sum(openmp)=2.84217e-14 sum(no openmp)= 4.15362e-14
size=512 sum(openmp)=5.68434e-14 sum(no openmp)= 5.68434e-14
size=1024 sum(openmp)=0 sum(no openmp)=-2.83332e-14
Here is the link to CodeExplorer.

Parallelizing many nested for loops in openMP c++

Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?

If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.

Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}

No speedup with OpenMP

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup.
Unfortunately I noticed that I could not get the desired speedup.
So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.
This is the toy example i wrote:
#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
int main () {
int number_of_threads = 1;
int n = 600;
int m = 50;
int N = n/number_of_threads;
int time_limit = 600;
double total_clock = omp_get_wtime();
int time_flag = 0;
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++){
CD[c] = C[z]*D[x];
C[z] = CD[c] + D[x];
}
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
#pragma omp critical
std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
}
}
I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).
However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.
Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?
Results
Running the code above I got the following results.
1 thread: 23 iterations.
20 threads: 397-401 iterations per thread (instead of 420-460).

Your measurement methodology is wrong. Especially for small number of iterations.
1 thread: 3 iterations.
3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.
2 threads: 5 iterations per thread (instead of 6).
4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.
20 threads: 40-44 iterations per thread (instead of 60).
40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.
As you can see your results actually do not contradict linear speedup.
It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.
Some reasons (non exhaustive) why you don't see linear speedup are:
Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
Synchronization between threads (e.g. critical sections). Not the case in your toy example.
Load imbalance between threads. Not the case in your toy example.
Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.
From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.
More general advise would be too broad for this format and require more specific information about the non-toy example.

You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
}
Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.
A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or
#pragma omp parallel
{
#pragma omp for
{
}
}
this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster.
be careful
#pragma omp parallel
{
for
{
}
}
will make the entire for loop for each thread, which will not speed up your program.

You should try
#pragma omp parallel num_threads(number_of_threads)
{
int thread_id = omp_get_thread_num();
int iteration_number_local = 0;
double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
while (time_flag == 0){
#pragma omp for
for (int i = 0; i < N; i++)
for(int z = 0; z < m; z++)
for(int x = 0; x < n; x++)
for(int c = 0; c < n; c++)
CD[c] = C[z]*D[x];
iteration_number_local++;
if ((omp_get_wtime() - total_clock) >= time_limit)
time_flag = 1;
}
if(thread_id == 0)
iteration_number = iteration_number_local;
}
std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

OpenMP code C++ is slower thatn c++

i have the following part of code, i run it on sample of N=3000, the c++ sequential code is faster by 3 seconds which is not good at all.
this code is filling the array jsd[N] with calculated values and i want to locate the maximum value and its location.
so
1- is this openmp conversion correct, and is there any better suggstion to make it more profissional
2- why it is slower that the equavilant c++ code, also the more threads i create the more it get slow.
thanks in advance
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i] = obj.function2(H, i + 1, N, Hl, Hr);
if (jsd[i] >= maxval) {
#pragma omp critical
{
maxval = jsd[i];
pos = i;
}
}
} // for
update:
here is the new code but still slow and get slower in more threads.
i update the code as following. but still get slower for more threads
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel num_threads(50)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i]= obj.function2(H, i + 1, N, Hl, Hr);
} // for
#pragma omp master
{
vector<double> jsd2 (jsd,jsd+N);
vector<double>::iterator jsditer;
jsditer = std::max_element(jsd2.begin(), jsd2.end());
maxval=*jsditer;
pos=std::distance(jsd2.begin(),jsditer) ;
// cout<<"pos"<<pos<<endl;
}
#pragma omp barrier

The first optimization I would suggest is to first compute all jsd values in the loop, then find the maximum element via std::max_element().
This way you are not forcing the threads to synchronise.
The second thing I would do is move over to Intel TBB instead of OpenMP and use parallel_reduce().
But the biggest question is, how complex are the objective functions you are evaluating.

openmp latency for inside for

I have a piece of code that i want to parallelize and the openmp program is much slower than the serial version, so what is wrong with my implementation?. This is the code of the program
#include <iostream>
#include <gsl/gsl_math.h>
#include "Chain.h"
using namespace std;
int main(){
int const N=1000;
int timeSteps=100;
double delta=0.0001;
double qq[N];
Chain ch(N);
ch.initCond();
for (int t=0; t<timeSteps; t++){
ch.changeQ(delta*t);
ch.calMag_i();
ch.calForce001();
}
ch.printSomething();
}
The Chain.h is
class Chain{
public:
int N;
double *q;
double *mx;
double *my;
double *force;
Chain(int const Np);
void initCond();
void changeQ(double delta);
void calMag_i();
void calForce001();
};
And the Chain.cpp is
Chain::Chain(int const Np){
this->N = Np;
this->q = new double[Np];
this->mx = new double[Np];
this->my = new double[Np];
this->force = new double[Np];
}
void Chain::initCond(){
for (int i=0; i<N; i++){
q[i] = 0.0;
force[i] = 0.0;
}
}
void Chain::changeQ(double delta){
int i=0;
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++){
q[i] = q[i] + delta*i + 1.0*i/N;
}
}
}
void Chain::calMag_i(){
int i =0;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++){
mx[i] = cos(q[i]);
my[i] = sin(q[i]);
}
}
}
void Chain::calForce001(){
int i;
int j;
double fij =0.0;
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for private(j, fij)
for (i=0; i<N; i++){
force[i] = 0.0;
for (j=0; j<i; j++){
fij = my[i]*mx[j] - mx[i]*my[j];
#pragma omp critical
{
force[i] += fij;
force[j] += -fij;
}
}
}
}
double time = omp_get_wtime() - start_time;
cout <<"time = " << time <<endl;
}
So the methods changeQ() and calMag_i() are in fact faster than the serial code, but my problem is the calForce001(). The execution time are:
with openMP 3.939s
without openMP 0.217s
Now, clearly i'm doing something wrong or the code can't be parallelize. Please any help with be usefull.
Thanks in advance.
Carlos
Edit:
In order to clarify the question i add the functions omp_get_wtime() to calculate the execution time for the function calForce001() and the times for one execution are
with omp :0.0376656
without omp: 0.00196766
So with omp method is 20 times slower.
Otherwise, i'm also calculate the time for the calMag_i() method
with omp: 3.3845e-05
without omp: 9.9516e-05
for this method omp is 3 times faster.
I hope this confirm that the latency problem is in the calForce001() method.

There are three reasons why you don't benefit from any speedup.
you have #pragma omp parallel all over your code. What this pragma does, is start the "team of threads". At the end of the block, this team is disbanded. This is quite costly. Removing those and using #pragma omp parallel for instead of #pragma omp for will start the team upon first encounter and put it to sleep after each block. This made the application 4x faster for me.
you use #pragma omp critical. On most platforms, this will force the use of a mutex - which is heavily contended because all threads want to write to that variable at the same time. So, don't use a critical section here. You could use atomic updates, but in this case, that won't make much of a difference - see third item. Just removing the critical section improved the speed by another 3x.
Parallelism only makes sense when you have an actual workload. All of your code is too small to benefit from parallelism. There's simply too little workload to win back the time lost on starting/waking/destroying the threads. If your workload would be ten times this, some of the parallel for statements would make sense. But especially Chain::calForce001() will never be worth it if you have to do atomic updates.
With respect to programming style: you're programming in C++. Please use local scope variables wherever you can - in e.g. Chain::calForce001(), use a local double fij inside the inner loop. That saves you from having to write private clauses. Compilers are smart enough to optimize that. Correct scoping allows for better optimizations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js