I am learning how to use OpenMP with C++ using GNU C compiler 6.2.1 and I tested the following code:
#include <stdio.h>
#include <omp.h>
#include <iostream>
int b=10;
int main()
{
int array[8];
std::cout << "Test with #pragma omp simd linear:\n";
#pragma omp simd linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
std::cout << "Test with #pragma omp parallel for linear:\n";
#pragma omp parallel for linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
}
In both cases I expected a list of numbers going from 10 to 17, however, this was not the case. The #pragma omp simd linear(b) is outright ignored, printing only 10 for each value in array. For #pragma omp parallel for linear(b) the program outputs 10,10,12,12,14,14,16,16.
I compile the file using g++ -fopenmp -Wall main.cpp -o main.o. How can I fix this?
EDIT: Reading the specification more carefully I found that the linear clausule overwrites the initial value with the last value obtained (i.e. if we start with b=10 after the first cycle we have b=17).
However, the program runs correctly if I add schedule(dynamic) to the parallel for cycles. Why would I have to specify that parameter in order to have a correct execution?
The OpenMP specification says:
The linear clause declares one or more list items to be private and to
have a linear relationship with respect to the iteration space of a
loop associated with the construct on which the clause appears.
This is an information only to the compiler to indicate the linear behavior of a variable in a loop, but in your code b is not increased at all. That is the reason you always get 10 in the first loop. So, the strange results obtained is not the compiler's fault. To correct it you have to use
array[n]=b++;
On the other hand, for #pragma omp parallel for linear(b) loop, OpenMP calculates the starting b value for each thread (based on the linear relationship), but this value is still not increased in a given thread. So, depending on the number of threads used you will see different number of "steps".
In the case of schedule(dynamic) clause, the chunk_size is 1, so each loop cycle runs in a different thread. In this case the initial b value is always calculated by OpenMP, so you get correct values only.
Related
I read here that sequential memory consistency (seq_cst) "might be needed" to make sure an atomic update is viewed by all threads consistently in an OpenMP parallel region.
Consider the following MWE, which is admittedly trivial and could be realized with a reduction rather than atomics, but which illustrates my question that arose in a more complex piece of code:
#include <iostream>
int main()
{
double a = 0;
#pragma omp parallel for
for (int i = 0; i < 10000000; ++i)
{
#pragma omp atomic
a += 5.5;
}
std::cout.precision(17);
std::cout << a << std::endl;
return 0;
}
I compiled this with g++ -fopenmp -O3 using GCC versions 6 to 12 on an Intel Core i9-9880H CPU, and then ran it using 4 or 8 threads, which always correctly prints:
55000000
When adding seq_cst to the atomic directive, the result is exactly the same. I would have expected the code without seq_cst to (occasionally) produce smaller results due to race conditions / outdated memory view. Is this hardware dependent? Is the code guaranteed to be free of race conditions even without seq_cst, and if so, why? Would the answer be different when using a compiler that was still based on OpenMP 3.1, as that apparently worked somewhat differently?
This code that I have written creates 2 threads and a for loop that iterates 10000 times but the value of x at the end comes out near 5000 instead of 10000, why is that happening?
#include<unistd.h>
#include<stdio.h>
#include<sys/time.h>
#include "omp.h"
using namespace std;
int x=0;
int main(){
omp_set_num_threads(2);
#pragma omp parallel for
for(int i= 0;i<10000;i++){
x+=1;
}
printf("x is: %d\n",x);
}
x is not an atomic type and is read and written in different threads. (Thinking that int is an atomic type is a common misconception.)
The behaviour of your program is therefore undefined.
Using std::atomic<int> x; is the fix.
The reason is, that when multiple threads access the same variable, race conditions can occur.
The operation x+=1 can be understand as: x = x + 1. So you first read the value of x and then write x + 1 to x. When you have two threads running and operating on the same value of x, following happens: Thread A reads the value of x which is 0. Thread B reads the value of x which is still 0. Then thread A writes 0+1 to x. And then Thread B writes 0+1 to x. And now you have missed one increment and x is just 1 instead of 2. A fix for this problem might be to use an atomic_int.
Modifying one (shared) value by multiple threads is a race condition and leads to wrong results. If multiple threads work with one value, all of them must only read the value.
The idiomatic solution is to use a OpenMP reduction as follows
#pragma omp parallel for reduction(+:x)
for(int i= 0;i<10000;i++){
x+=1;
}
Internally, each thread has it's own x and they are added together after the loop.
Using atomics is an alternative, but will perform significantly worse. Atomic operations are more costly in itself and also very bad for caches.
If you use atomics, you should use OpenMP atomics which are applied to the operation, not the variable. I.e.
#pragma omp parallel for
for (int i= 0;i<10000;i++){
#pragma omp atomic
x+=1;
}
You should not, as other answers suggest, use C++11 atomics. Using them is explicitly unspecified behavior in OpenMP. See this question for details.
I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.
private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.
You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.
Does the OpenMP standard guarantee #pragma omp simd to work, i.e. should the compilation fail if the compiler can't vectorize the code?
#include <cstdint>
void foo(uint32_t r[8], uint16_t* ptr)
{
const uint32_t C = 1000;
#pragma omp simd
for (int j = 0; j < 8; ++j)
if (r[j] < C)
r[j] = *(ptr++);
}
gcc and clang fail to vectorize this but do not complain at all (unless you use -fopt-info-vec-optimized-missed and the like).
No, it is not guaranteed. Relevant portions of the OpenMP 4.5 standard that I could find (emphasis mine):
(1.3) When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread.
(2.8.1) The simd construct can be applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions).
(Appendix C) The number of iterations that are executed concurrently at any given time is implementation defined.
(1.2.7) implementation defined: Behavior that must be documented by the implementation, and is allowed to vary among different compliant implementations. An implementation is allowed to define this behavior as unspecified.
i have got the following fraction of code that getting me the stack overflow error
#pragma omp parallel shared(Mo1, Mo2, sum_normalized_p_gn, Data, Mean_Out,Covar_Out,Prior_Out, det) private(i) num_threads( number_threads )
{
//every thread has a new copy
double* normalized_p_gn = (double*)malloc(NMIX*sizeof(double));
#pragma omp critical
{
int id = omp_get_thread_num();
int threads = omp_get_num_threads();
mexEvalString("drawnow");
}
#pragma omp for
//some parallel process.....
}
the variables declared in the shared are created by malloc. and they consumes with large amount of memory
there are 2 questions regarding to the above code.
1) why this would generate the stack overflow error( i.e. segmentation fault) before it goes into the parallel for loop? it works fine when it runs in the sequential mode....
2) am i right to dynamic allocate memory for each thread like "normalized_p_gn" above?
Regards
Edwin
In place of malloc, use mxMalloc in mex files (see here). Don't forget to mxFree when you're finished with the memory.
One possibility which we can't exclude, your code snippet doesn't reveal any numbers, is that you are simply trying to allocate too much memory when you run this in parallel. If you can confirm that this is not your problem, comment or edit your question and I'll take another look.