It seems to compile, but I just wanted to ask if there are other considerations or reasons why this might not work as expected:
std::vector<int> myvec(100,1)
#pragma omp parallel for schedule(static)
for (int i = 0; i < (int) myvec.size(); ++i)
{
assert(myvec[i] == 1);
}
Related
So I started using OpenMP (multithreading) to increase the speed of my matrix multiplication and I witnessed weird things: when I turn off OpenMP Support (in Visual Studio 2019) my nested for-loop completes 2x faster. So I removed "#pragma omp critical" to test if it slows down the proccess significantly and the proccess went 4x faster than before (with OpenMP Support On).
Here's my question: is "#pragma omp critical" important in nested loop? Can't I just skip it?
#pragma omp parallel for collapse(3)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
#pragma omp critical
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
Here's my question: is "#pragma omp critical" important in nested
loop? Can't I just skip it?
If the matrices m, this and A are different you do not need any critical region. Instead, you need to ensure that each thread will write to a different position of the matrix m as follows:
#pragma omp parallel for collapse(2)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
The collapse clause will assign to each thread a different pair (i, j) therefore there will not be multiple threads writing to the same position of the matrix m (i.e., race-condition).
#pragma omp critical is necessary here, as there is a (remote) chance that two threads could write to a particular m.matrix[i][j] value. It hurts performance because only one thread at a time can access that protected assignment statement.
This would likely be better without the collapse part (then you can remove the #pragma omp critical). Accumulate the sums to a temporary local variable, then store it in m.matrix[i][j] after the k loop finishes.
I'm trying to make my serial programm parallel with openMP. Here is the code where I have a big parallel region with a number of internal "#pragma omp for" sections. In serial version I have a function fftw_shift() which has "for" loops inside it too.
The question is how to rewrite the fftw_shift() function properly in order to already existed threads in the external parallel region could split "for" loops inside with no nested threads.
I'm not sure that my realisation works correctly. There is the way to inline the whole function in parallel region but I'm trying to realise how to deal with it in the described situation.
int fftw_shift(fftw_complex *pulse, fftw_complex *shift_buf, int
array_size)
{
int j = 0; //counter
if ((pulse != nullptr) || (shift_buf != nullptr)){
if (omp_in_parallel()) {
//shift the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size / 2; j++) {
//left to right
shift_buf[(array_size / 2) + j][REAL] = pulse[j][REAL]; //real
shift_buf[(array_size / 2) + j][IMAG] = pulse[j][IMAG]; //imaginary
//right to left
shift_buf[j][REAL] = pulse[(array_size / 2) + j][REAL]; //real
shift_buf[j][IMAG] = pulse[(array_size / 2) + j][IMAG]; //imaginary
}
//rewrite the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size; j++) {
pulse[j][REAL] = shift_buf[j][REAL]; //real
pulse[j][IMAG] = shift_buf[j][IMAG]; //imaginary
}
return 0;
}
}
....
#pragma omp parallel firstprivate(x, phase) if(array_size >=
OMP_THREASHOLD)
{
// First half-step
#pragma omp for schedule(dynamic)
for (x = 0; x < array_size; x++) {
..
}
// Forward FTW
fftw_shift(pulse_x, shift_buf, array_size);
#pragma omp master
{
fftw_execute(dft);
}
#pragma omp barrier
fftw_shift(pulse_kx, shift_buf, array_size);
...
}
If you call fftw_shift from a parallel region - but not a work-sharing construct (i.e. not in a parallel for), then you can just use omp for just as if you were inside a parallel region. This is called an orphaned directive.
However, your loops just copy data, so don't expect a perfect speedup depending on your system.
What (if any) differences are there between using:
#pragma omp parallel
{
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
}
and:
#pragma omp parallel for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
Or does the compiler(ICC) care?
I know that the first one defines a parallel section and than a for loop to be divided up and you can multiple things after the loop. Please do correct me if I'm wrong, still learning the ways of openmp..
But when would you use one way or the other?
Simply put, if you only have 1 for-loop that you want to parallelise use #pragma omp parallel for simd.
If you want to parallelise multiple for-loops or add any other parallel routines before or after the current for-loop, use:
#pragma omp parallel
{
// Other parallel code
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
// Other parallel code
}
This way you don't have to reopen the parallel section when adding more parallel routines, reducing overhead time.
Every time I try to print out the threadID, and regardless of where I put the print statement, it always prints the threadId = 0. It looks like there is only one thread being created, but why? I don't see what I'm doing wrong. Also, I've checked and num_t = 16. I've also made sure that I use -fopenmp when compiling.
omp_set_num_threads(num_t);
#pragma omp parallel shared(a,b,c) private(i,j,k) num_threads(num_t)
{
#pragma omp for schedule(static)
for (int i = 0; i < m; i++)
{
std::cout << omp_get_thread_num()<< "\n";
for (int j = 0; (j < n); j++)
{
c[i + j*m] = 0.0;
for (int k = 0; k < q; k++)
{
c[i+j*m] += a[i*q + k]*b[j*q + k];
}
}
}
}
To test first, I recommend you to use this:
#pragma omp parallel for private(...) shared(...) schedule(...) num_threads (X)
where "X" is the number of threads to be created. In theory, the previous line must have a similar effect to yours, but C++ can be picky sometimes (specially with the "parallel" clause)
Btw, maybe is not your case, but be careful using "text keys" {}. OpenMP's functionality can be different depending on adding them to the code block or not.
I'm using OpenMP to improve my program efficiency on loops.
But recently I discovered that on small loops the use of this library decreased performances and that using the normal way was better.
In fact, I'd like to use openMP only if a condition is satisfied, my code is
#pragma omp parallel for
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
But what I want to do is to disable the #pragma if size is small enough i.e.:
if (size > OMP_MIN_VALUE)
#pragma omp parallel for
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
But does not work, the better way is to write the loop twice but I don't want to do that way...
if (size > OMP_MIN_VALUE)
{
#pragma omp parallel for
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
}
else
{
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
}
What is the better way to do that?
I think you should be able to achieve the effect you're looking for by using the optional schedule clause on your parallel for directive:
#pragma omp parallel for schedule(static, OMP_MIN_VALUE)
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
You might want to play around with different kinds of scheduling though and different chunk sizes to see what suits your library routines best.