I am doing some very simple tests with OpenMP in C++ and I encounter a problem that is probably silly, but I can't find out what's wrong. In the following MWE:
#include <iostream>
#include <ctime>
#include <vector>
#include <omp.h>
int main()
{
int nthreads=1, threadid=0;
clock_t tstart, tend;
const int nx=10, ny=10, nz=10;
int i, j, k;
std::vector<std::vector<std::vector<long long int> > > arr_par;
arr_par.resize(nx);
for (i=0; i<nx; i++) {
arr_par[i].resize(ny);
for (j = 0; j<ny; j++) {
arr_par[i][j].resize(nz);
}
}
tstart = clock();
#pragma omp parallel default(shared) private(threadid)
{
#ifdef _OPENMP
nthreads = omp_get_num_threads();
threadid = omp_get_thread_num();
#endif
#pragma omp master
std::cout<<"OpenMP execution with "<<nthreads<<" threads"<<std::endl;
#pragma omp end master
#pragma omp barrier
#pragma omp critical
{
std::cout<<"Thread id: "<<threadid<<std::endl;
}
#pragma omp for
for (i=0; i<nx; i++) {
for (j=0; j<ny; j++) {
for (k=0; k<nz; k++) {
arr_par[i][j][k] = i*j + k;
}
}
}
}
tend = clock();
std::cout<<"Elapsed time: "<<(tend - tstart)/double(CLOCKS_PER_SEC)<<" s"<<std::endl;
return 0;
}
if nx, ny and nz are equal to 10, the code is running smoothly. If I increase these numbers to 20, I get a segfault. It runs without problem sequentially or with OMP_NUM_THREADS=1, whatever the number of elements.
I compiled the damn thing with
g++ -std=c++0x -fopenmp -gstabs+ -O0 test.cpp -o test
using GCC 4.6.3.
Any thought would be appreciated!
You have a data race in your loop counters:
#pragma omp for
for (i=0; i<nx; i++) {
for (j=0; j<ny; j++) { // <--- data race
for (k=0; k<nz; k++) { // <--- data race
arr_par[i][j][k] = i*j + k;
}
}
}
Since neither j nor k are given the private data-sharing class, their values might exceed the corresponding limits when several threads try to increase them at once, resulting in out-of-bound access to arr_par. The chance to have several threads increase j or k at the same time increases with the number of iterations.
The best way to treat those cases is to simply declare the loop variables inside the loop operator itself:
#pragma omp for
for (int i=0; i<nx; i++) {
for (int j=0; j<ny; j++) {
for (int k=0; k<nz; k++) {
arr_par[i][j][k] = i*j + k;
}
}
}
The other way is to add the private(j,k) clause to the head of the parallel region:
#pragma omp parallel default(shared) private(threadid) private(j,k)
It is not strictly necessary to make i private in your case since the loop variable of parallel loops are implicitly made private. Still, if i is used somewhere else in the code, it might make sense to make it private to prevent other data races.
Also, don't use clock() to measure the time for parallel applications since on most Unix OSes it returns the total CPU time for all threads. Use omp_get_wtime() instead.
Related
I have written the below code to parallelize two 'for' loops.
#include <iostream>
#include <omp.h>
#define SIZE 100
int main()
{
int arr[SIZE];
int sum = 0;
int i, tid, numt, prod;
double t1, t2;
for (i = 0; i < SIZE; i++)
arr[i] = 0;
t1 = omp_get_wtime();
#pragma omp parallel private(tid, prod)
{
tid = omp_get_thread_num();
numt = omp_get_num_threads();
std::cout << "Tid: " << tid << " Thread: " << numt << std::endl;
#pragma omp for reduction(+: sum)
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
#pragma omp for reduction(+: sum)
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
t2 = omp_get_wtime();
std::cout << "Time taken: " << (t2 - t1) << ", Parallel sum: " << sum << std::endl;
return 0;
}
In this case the execution of 1st 'for' loop is done in parallel by all the threads and the result is accumulated in sum variable. After the execution of the 1st 'for' loop is done, threads start executing the 2nd 'for' loop in parallel and the result is accumulated in sum variable. In this case clearly the execution of the 2nd 'for' loop waits for the execution of the 1st 'for' loop to get over.
I want to do the processing of the two 'for' loop simultaneously over threads. How can I do that? Is there any other way I can write this code more efficiently. Ignore the dummy work that I am doing inside the 'for' loop.
You can declare the loops nowait and move the reduction to the end of the parallel section. Something like this:
# pragma omp parallel private(tid, prod) reduction(+: sum)
{
# pragma omp for nowait
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
# pragma omp for nowait
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
If you use #pragma omp for nowait all threads are assigned to the first loop, the second loop will only start if at least one thread finished in the first loop. Unfortunately, there is no way to tell the omp for construct to use e.g. only half of the threads.
Fortunately, there is a solution to do so (i.e. to run the 2 loops parallel) by using tasks. The following code will use half of the threads to run the first loop, the other half to run the second one using the taskloop construct and num_threads clause to control the threads assigned for a loop. This will do exactly what you intended, but you have to test which solution is faster in your case.
#pragma omp parallel
#pragma omp single
{
int n=omp_get_num_threads();
#pragma omp taskloop num_tasks(n/2)
for (int i = 0; i < 50; i++) {
//do something
}
#pragma omp taskloop num_tasks(n/2)
for (int i = 50; i < SIZE; i++) {
//do something
}
}
UPDATE: The first paragraph is not entirely correct, by changing the chunk_size you have some control how many threads will be used in the first loop. It can be done by using e.g. schedule(linear, chunk_size) clause. So, I thought setting the chunk_size will do the trick:
#pragma omp parallel
{
int n=omp_get_num_threads();
#pragma omp single
printf("num_threads=%d\n",n);
#pragma omp for schedule(static,2) nowait
for (int i = 0; i < 4; i++) {
printf("thread %d running 1st loop\n", omp_get_thread_num());
}
#pragma omp for schedule(static,2)
for (int i = 4; i < SIZE; i++) {
printf("thread %d running 2nd loop\n", omp_get_thread_num());
}
}
BUT at first the result seems surprising:
num_threads=4
thread 0 running 1st loop
thread 0 running 1st loop
thread 0 running 2nd loop
thread 0 running 2nd loop
thread 1 running 1st loop
thread 1 running 1st loop
thread 1 running 2nd loop
thread 1 running 2nd loop
What is going on? Why threads 2 and 3 not used? OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration ranges in both parallel regions.
On the other hand result of using schedule(dynamic,2) clause was quite surprising - only one thread is used, CodeExplorer link is here.
In an openMP framework, suppose I have a series of tasks that should be done by a single task. Each task is different, so I cannot fit into a #pragma omp for construct. Inside the single construct, each task updates a variable shared by all tasks. How can I protect the update of such a variable?
A simplified example:
#include <vector>
struct A {
std::vector<double> x, y, z;
};
int main()
{
A r;
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
#pragma omp barrier
return 0;
}
The code lines below // DANGER are problematic because they modify the memory contents of a shared variable.
In the example above, it might be that it still works without issues, because I am effectively modifying different members of r. Still the problem is: how can I make sure that tasks do not simultaineusly update r? Is there a "sort-of" atomic pragma for the single construct?
There is no data race in your original code, because x,y, and z are different vectors in struct A (as already emphasized by #463035818_is_not_a_number), so in this respect you do not have to change anything in your code.
However, a #pragma omp parallel directive is missing in your code, so at the moment it is a serial program. So, it should look like this:
#pragma omp parallel num_threads(3)
{
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
}
In this case #pragma omp barrier is not necessary as there is an implied barrier at the end of parallel region. Note that I have used num_threads(3) clause to make sure that only 3 threads are assigned to this parallel region. If you skip this clause then all other threads just wait at the barrier.
In the case of an actual data race (i.e. more than one single region/section changes the same struct member), you can use #pragma omp critical (name) to rectify this. But keep in mind that this kind of serialization can negate the benefits of multithreading when there is not enough real parallel work beside the critical section.
Note that, a much better solution is to use #pragma omp sections (as suggested by #PaulG). If the number of tasks to run parallel is known at compile time sections are the typical choice in OpenMP:
#pragma omp parallel sections
{
#pragma omp section
{
//Task 1 here
}
#pragma omp section
{
//Task 2
}
#pragma omp section
{
// Task 3
}
}
For the record, I would like to show that it is easy to do it by #pragma omp for as well:
#pragma omp parallel for
for(int i=0;i<3;i++)
{
if (i==0)
{
// Task 1
} else if (i==1)
{
// Task 2
}
else if (i==2)
{
// Task 3
}
}
each task updates a variable shared by all tasks.
Actually they don't. Consider you rewrite the code like this (you don't need the temporary vectors):
void foo( std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
x.push_back(i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
y.push_back(i * i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
z.push_back(i * i + 2);
}
#pragma omp barrier
}
As long as the caller can ensure that x, y and z do not refer to the same object, there is no data race. Each part of the code modifies a seperate vector. No synchronization needed.
Now, it does not matter where those vectors come from. You can call the function like this:
A r;
foo(r.x, r.y, r.z);
PS: I am not familiar with omp anymore, I assumed the annotations correctly do what you want them to do.
Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?
If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.
Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}
I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.
At the start of #pragma omp parallel a bunch of threads are created, then when we get to #pragma omp for the workload is distributed. What happens if this for loop has a for loop inside it, and I place a #pragma omp for before it as well? Does each thread create new threads? If not, which threads are assigned this task? What exactly happens in this situation?
By default, no threads are spawned for the inner loop. It is done sequentially using the thread that reaches it.
This is because nesting is disabled by default. However, if you enable nesting via omp_set_nested(), then a new set of threads will be spawned.
However, if you aren't careful, this will result in p^2 number of threads (since each of the original p threads will spawn another p threads.) Therefore nesting is disabled by default.
In a situation like the following:
#pragma omp parallel
{
#pragma omp for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
}
what happens is that you trigger an undefined behavior as you violate the OpenMP standard. More precisely you violate the restrictions appearing in section 2.5 (worksharing constructs):
The following restrictions apply to worksharing constructs:
Each worksharing region must be encountered by all threads in a team or by none at all.
The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.
This is clearly shown in the examples A.39.1c and A.40.1c:
Example A.39.1c: The following example of loop construct nesting is conforming because the inner and outer loop regions bind to different parallel
regions:
void work(int i, int j) {}
void good_nesting(int n)
{
int i, j;
#pragma omp parallel default(shared)
{
#pragma omp for
for (i=0; i<n; i++) {
#pragma omp parallel shared(i, n)
{
#pragma omp for
for (j=0; j < n; j++)
work(i, j);
}
}
}
}
Example A.40.1c: The following example is non-conforming because the inner and outer loop regions are closely nested
void work(int i, int j) {}
void wrong1(int n)
{
#pragma omp parallel default(shared)
{
int i, j;
#pragma omp for
for (i=0; i<n; i++) {
/* incorrect nesting of loop regions */
#pragma omp for
for (j=0; j<n; j++)
work(i, j);
}
}
}
Notice that this is different from:
#pragma omp parallel for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp parallel for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
in which you try to spawn a nested parallel region. Only in this case the discussion of Mysticial answer holds.