Every time I try to print out the threadID, and regardless of where I put the print statement, it always prints the threadId = 0. It looks like there is only one thread being created, but why? I don't see what I'm doing wrong. Also, I've checked and num_t = 16. I've also made sure that I use -fopenmp when compiling.
omp_set_num_threads(num_t);
#pragma omp parallel shared(a,b,c) private(i,j,k) num_threads(num_t)
{
#pragma omp for schedule(static)
for (int i = 0; i < m; i++)
{
std::cout << omp_get_thread_num()<< "\n";
for (int j = 0; (j < n); j++)
{
c[i + j*m] = 0.0;
for (int k = 0; k < q; k++)
{
c[i+j*m] += a[i*q + k]*b[j*q + k];
}
}
}
}
To test first, I recommend you to use this:
#pragma omp parallel for private(...) shared(...) schedule(...) num_threads (X)
where "X" is the number of threads to be created. In theory, the previous line must have a similar effect to yours, but C++ can be picky sometimes (specially with the "parallel" clause)
Btw, maybe is not your case, but be careful using "text keys" {}. OpenMP's functionality can be different depending on adding them to the code block or not.
Related
I have written the below code to parallelize two 'for' loops.
#include <iostream>
#include <omp.h>
#define SIZE 100
int main()
{
int arr[SIZE];
int sum = 0;
int i, tid, numt, prod;
double t1, t2;
for (i = 0; i < SIZE; i++)
arr[i] = 0;
t1 = omp_get_wtime();
#pragma omp parallel private(tid, prod)
{
tid = omp_get_thread_num();
numt = omp_get_num_threads();
std::cout << "Tid: " << tid << " Thread: " << numt << std::endl;
#pragma omp for reduction(+: sum)
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
#pragma omp for reduction(+: sum)
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
t2 = omp_get_wtime();
std::cout << "Time taken: " << (t2 - t1) << ", Parallel sum: " << sum << std::endl;
return 0;
}
In this case the execution of 1st 'for' loop is done in parallel by all the threads and the result is accumulated in sum variable. After the execution of the 1st 'for' loop is done, threads start executing the 2nd 'for' loop in parallel and the result is accumulated in sum variable. In this case clearly the execution of the 2nd 'for' loop waits for the execution of the 1st 'for' loop to get over.
I want to do the processing of the two 'for' loop simultaneously over threads. How can I do that? Is there any other way I can write this code more efficiently. Ignore the dummy work that I am doing inside the 'for' loop.
You can declare the loops nowait and move the reduction to the end of the parallel section. Something like this:
# pragma omp parallel private(tid, prod) reduction(+: sum)
{
# pragma omp for nowait
for (i = 0; i < 50; i++) {
prod = arr[i]+1;
sum += prod;
}
# pragma omp for nowait
for (i = 50; i < SIZE; i++) {
prod = arr[i]+1;
sum += prod;
}
}
If you use #pragma omp for nowait all threads are assigned to the first loop, the second loop will only start if at least one thread finished in the first loop. Unfortunately, there is no way to tell the omp for construct to use e.g. only half of the threads.
Fortunately, there is a solution to do so (i.e. to run the 2 loops parallel) by using tasks. The following code will use half of the threads to run the first loop, the other half to run the second one using the taskloop construct and num_threads clause to control the threads assigned for a loop. This will do exactly what you intended, but you have to test which solution is faster in your case.
#pragma omp parallel
#pragma omp single
{
int n=omp_get_num_threads();
#pragma omp taskloop num_tasks(n/2)
for (int i = 0; i < 50; i++) {
//do something
}
#pragma omp taskloop num_tasks(n/2)
for (int i = 50; i < SIZE; i++) {
//do something
}
}
UPDATE: The first paragraph is not entirely correct, by changing the chunk_size you have some control how many threads will be used in the first loop. It can be done by using e.g. schedule(linear, chunk_size) clause. So, I thought setting the chunk_size will do the trick:
#pragma omp parallel
{
int n=omp_get_num_threads();
#pragma omp single
printf("num_threads=%d\n",n);
#pragma omp for schedule(static,2) nowait
for (int i = 0; i < 4; i++) {
printf("thread %d running 1st loop\n", omp_get_thread_num());
}
#pragma omp for schedule(static,2)
for (int i = 4; i < SIZE; i++) {
printf("thread %d running 2nd loop\n", omp_get_thread_num());
}
}
BUT at first the result seems surprising:
num_threads=4
thread 0 running 1st loop
thread 0 running 1st loop
thread 0 running 2nd loop
thread 0 running 2nd loop
thread 1 running 1st loop
thread 1 running 1st loop
thread 1 running 2nd loop
thread 1 running 2nd loop
What is going on? Why threads 2 and 3 not used? OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration ranges in both parallel regions.
On the other hand result of using schedule(dynamic,2) clause was quite surprising - only one thread is used, CodeExplorer link is here.
So I started using OpenMP (multithreading) to increase the speed of my matrix multiplication and I witnessed weird things: when I turn off OpenMP Support (in Visual Studio 2019) my nested for-loop completes 2x faster. So I removed "#pragma omp critical" to test if it slows down the proccess significantly and the proccess went 4x faster than before (with OpenMP Support On).
Here's my question: is "#pragma omp critical" important in nested loop? Can't I just skip it?
#pragma omp parallel for collapse(3)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
#pragma omp critical
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
Here's my question: is "#pragma omp critical" important in nested
loop? Can't I just skip it?
If the matrices m, this and A are different you do not need any critical region. Instead, you need to ensure that each thread will write to a different position of the matrix m as follows:
#pragma omp parallel for collapse(2)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
The collapse clause will assign to each thread a different pair (i, j) therefore there will not be multiple threads writing to the same position of the matrix m (i.e., race-condition).
#pragma omp critical is necessary here, as there is a (remote) chance that two threads could write to a particular m.matrix[i][j] value. It hurts performance because only one thread at a time can access that protected assignment statement.
This would likely be better without the collapse part (then you can remove the #pragma omp critical). Accumulate the sums to a temporary local variable, then store it in m.matrix[i][j] after the k loop finishes.
I'm trying to make my serial programm parallel with openMP. Here is the code where I have a big parallel region with a number of internal "#pragma omp for" sections. In serial version I have a function fftw_shift() which has "for" loops inside it too.
The question is how to rewrite the fftw_shift() function properly in order to already existed threads in the external parallel region could split "for" loops inside with no nested threads.
I'm not sure that my realisation works correctly. There is the way to inline the whole function in parallel region but I'm trying to realise how to deal with it in the described situation.
int fftw_shift(fftw_complex *pulse, fftw_complex *shift_buf, int
array_size)
{
int j = 0; //counter
if ((pulse != nullptr) || (shift_buf != nullptr)){
if (omp_in_parallel()) {
//shift the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size / 2; j++) {
//left to right
shift_buf[(array_size / 2) + j][REAL] = pulse[j][REAL]; //real
shift_buf[(array_size / 2) + j][IMAG] = pulse[j][IMAG]; //imaginary
//right to left
shift_buf[j][REAL] = pulse[(array_size / 2) + j][REAL]; //real
shift_buf[j][IMAG] = pulse[(array_size / 2) + j][IMAG]; //imaginary
}
//rewrite the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size; j++) {
pulse[j][REAL] = shift_buf[j][REAL]; //real
pulse[j][IMAG] = shift_buf[j][IMAG]; //imaginary
}
return 0;
}
}
....
#pragma omp parallel firstprivate(x, phase) if(array_size >=
OMP_THREASHOLD)
{
// First half-step
#pragma omp for schedule(dynamic)
for (x = 0; x < array_size; x++) {
..
}
// Forward FTW
fftw_shift(pulse_x, shift_buf, array_size);
#pragma omp master
{
fftw_execute(dft);
}
#pragma omp barrier
fftw_shift(pulse_kx, shift_buf, array_size);
...
}
If you call fftw_shift from a parallel region - but not a work-sharing construct (i.e. not in a parallel for), then you can just use omp for just as if you were inside a parallel region. This is called an orphaned directive.
However, your loops just copy data, so don't expect a perfect speedup depending on your system.
I have an issue with parallelizing two for loops with OpenMP in C++. I have a memberfunction CallFunction(i,j) which sets for every i and j independent member variables to a specific value and returns a weighted sum of this values. Because these functions are independent for different combinations of i and j, I want to parallelize this process. I tried it in the following way:
double optimal_value = 0;
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
if(i == j) continue;
optimal_value += CallFunction(i,j);
}
}
Above code does not have a significant effect on my runtime. I achieve almost the same runtime with and without "#pragma omp parallel for". Would it be better to write the nested loop as one loop and parallelize it? I have to idea how to make it work. Do I need further commands or settings except for activated openmp?
My system is running with a dual core cpu.
Would you please help me how I have to do it right?
Many thanks in advance!
Here is the parallelization of two loops
double optimal_value = 0;
double begin = omp_get_wtime();
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
num_tr = omp_get_num_threads();
double optimal_value_in = 0.0;
#pragma omp parallel for reduction(+:optimal_value_in)
for (int j = 0; j < n; j++)
{
if((i == j)) continue;
optimal_value_in += CallFunction(i,j);
}
optimal_value += optimal_value_in;
}
double end = omp_get_wtime();
double elapsed_secs = double(end - begin);
cout<<"############# "<<"Using #Threads "<<num_tr<<endl;
cout<<"############# "<<optimal_value<<" Time For Parallel Execution :: "<<elapsed_secs<<endl;
The thing here is (also mentioned above in comments by others) ... I am not sure if you will see some speedup with just n=25 with the body of CallFunction as
double CallFunction(int i, int j){
return i*j;
}
with n=250000 and with 8 threads, I got a speed up of 4.43 so it will strongly depend on what is done in CallFunction.
I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.