Is this a data race? OPENMP - c++

I am a bit confused if there is a data race for variable k. To my understanding, only one thread will execute the single construct but since no wait is asserted, threads will start executing the for construct immediately. Is atomic here enough to prevent any potential data race?
#include <stdio.h>
#include <omp.h>
#define Nthreads 8
void main()
{
int n =9, l,k =n,i,j;
k += n+1;
omp_set_num_threads(Nthreads);
#pragma omp parallel default(none) shared(n, k) private(j)
{
#pragma omp single nowait
{
k = k+5;
}
#pragma omp for nowait
for( i =0; i< n; i++)
{
#pragma omp atomic
k +=n+i+1;
}
}
}

Assuming the rest of the code expresses your intended algorithm correctly, it almost is. As you were probably suspecting yourself, you need to protect k' s update in the single region as well.
j is an unused variable, I do not know if you simply forgot to delete it or you were trying to implement sth else.
I would have used a reduction clause for k in the for loop instead of using a synchronization construct. I do not know if it would have been better or faster though.

Related

OpenMP nested loop task parallelism, counter not giving correct result

I am pretty new in openMP. I am trying to parallelize the nested loop using tasking but it didn't give me the correct counter output. Sequential output is "Total pixel = 100000000". Can anyone help me with that?
Note: I have done this using #pragma omp parallel for reduction (+:pixels_inside) private(i,j). This works fine now I want to use tasking.
what I have try so far:
#include<iostream>
#include<omp.h>
using namespace std;
int main(){
int total_steps = 10000;
int i,j;
int pixels_inside=0;
omp_set_num_threads(4);
//#pragma omp parallel for reduction (+:pixels_inside) private(i,j)
#pragma omp parallel
#pragma omp single private(i)
for(i = 0; i < total_steps; i++){
#pragma omp task private(j)
for(j = 0; j < total_steps; j++){
pixels_inside++;
}
}
cout<<"Total pixel = "<<pixels_inside<<endl;
return 0;
}
First of all you need to declare for OpenMP what variables you are using and what protection do they have. Generally speaking your code has default(shared) as you didn't specified otherwise. This makes all variables accessible with same memory location for all threads.
You should use something like this:
#pragma omp parallel default(none) shared(total_steps, pixels_inside)
[...]
#pragma omp task private(j) default(none) shared(total_steps, pixels_inside)
Now, only what is necessary will be used by threads.
Secondly the main problem is that you don't have critical section protection. What this means, that when threads are running they may wish to use shared variable and race condition happens. For example, you have thread A and B with variable x accessible to both (a.k.a. shared memory variable). Now lets say A adds 2 and B adds 3 to the variable. Threads aren't same speed so this may happen, A takes x=0, B takes x=0, A adds 0+2, B adds 0+3, B returns data to memory location x=3, A returns data to memory location x=2. In end x = 2. The same happens with pixels_inside, as thread takes variable, adds 1 and returns it back from where it got it. To overcome this you use measurements to insure critical section protection:
#pragma omp critical
{
//Code with shared memory
pixels_inside++;
}
You didn't needed critical section protection in reduction as variables in recution parameters have this protection.
Now your code should look like this:
#include <iostream>
#include <omp.h>
using namespace std;
int main() {
int total_steps = 10000;
int i,j;
int pixels_inside=0;
omp_set_num_threads(4);
//#pragma omp parallel for reduction (+:pixels_inside) private(i,j)
#pragma omp parallel default(none) shared(total_steps, pixels_inside)
#pragma omp single private(i)
for(i = 0; i < total_steps; i++){
#pragma omp task private(j) default(none) shared(total_steps, pixels_inside)
for(j = 0; j < total_steps; j++){
#pragma omp critical
{
pixels_inside++;
}
}
}
cout<<"Total pixel = "<<pixels_inside<<endl;
return 0;
}
Although I would suggest using reduction as it has better performance and methods to optimize that kind of calculations.
As #tartarus already explained you have a race condition in your code and it is much better to avoid it by using reduction. If you what to do the same as #pragma omp parallel for reduction (+:pixels_inside) private(i,j) do but using tasks, you have to use the following:
#pragma omp parallel
#pragma omp single
#pragma omp taskloop reduction (+:pixels_inside) private(i,j)
for(i = 0; i < total_steps; i++){
for(j = 0; j < total_steps; j++){
pixels_inside++;
}
}
In this version fewer tasks are created and reduction is used instead of critical section, therefore the performance will be much better (similar to what you can obtain by using #pragma omp parallel for)
UPDATE(comment on performance): I guess it is just a simplified example not your real code to parallelize. If the performance gain is not good enough, most probably it means that the parallel overhead is bigger than the work to do. In this case try to parallelize bigger part of your code. Note that parallel overheads are typically bigger in case of tasks (compared to #pragma omp parallel for).

replacement for #pragma omp critical (C++)

I am using openMp on a nested loop which works like this
#pragma omp parallel shared(vector1) private(i,j)
{
#pragma omp for schedule(dynamic)
for (i = 0; i < vector1.size(); ++i){
//some code here
for (j = 0; j < vector1.size(); ++j){
//some other code goes here
#pragma omp critical
A+=B;
}
C +=A;
}
}
the Problem here is that my code is doing a lot of the computation in the A+=B part of the code. Therefore by making it critical, I am not achieving the speedup I would like. (In fact there appears to be some overhead since my program is taking longer to execute then it being sequentially written).
I tried using
#pragma omp reduction private(B) reduction(+:A)
A+=B
this speeds up the execution time however is seems that it does not take care of race conditions like the critical clause since I am not getting the same results of A.
Is there an alternative to this i can try?
Unless you want to go through the trouble of making your Vector3 class thread-safe or rewriting your operations for use with an std::atomic<Vector3>, both of which would still suffer from performance drawbacks (although not as serious as using a critical section), you can actually mimic the behaviour of OpenMP reduction:
#pragma omp parallel // no need to declare variables declared outside/inside as shared/private
{
Vector3 A{}, LocalC{}; // both thread-private
#pragma omp for schedule(dynamic)
for (i = 0; i < vector1.size(); ++i){
//some code here
for (j = 0; j < vector1.size(); ++j){
//some other code goes here
A += B; // does not need a barrier
}
LocalC += A; // does not need a barrier
}
#pragma omp critical
C += LocalC;
}
NB that this assumes that you don't access A for reading within your "some code" comments, but you shouldn't anyway if you ever thought of using a reduction clause.

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

Influence on the static scheduling overhead in OpenMP

I thought about which factors would influence the static scheduling overhead in OpenMP.
In my opinion it is influenced by:
CPU performance
specific implementation of the OpenMP run-time library
the number of threads
But am I missing further factors? Maybe the size of the tasks, ...?
And furthermore: Is the overhead linearly dependent on the number of iterations?
In this case I would expect that having static scheduling and 4 cores, the overhead increases linearly with 4*i iterations. Correct so far?
EDIT:
I am only interested in the static (!) scheduling overhead itself. I am not talking about thread start-up overhead and time spent in synchronisation and critical section overhead.
You need to separate the overhead for OpenMP to create a team/pool of threads and the overhead for each thread to operate separate sets of iterators in a for loop.
Static scheduling is easy to implement by hand (which is sometimes very useful). Let's consider what I consider the two most important static scheduling schedule(static) and schedule(static,1) then we can compare this to schedule(dynamic,chunk).
#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++) foo(i);
is equivalent to (but not necessarily equal to)
#pragma omp parallel
{
int start = omp_get_thread_num()*N/omp_get_num_threads();
int finish = (omp_get_thread_num()+1)*N/omp_get_num_threads();
for(int i=start; i<finish; i++) foo(i);
}
and
#pragma omp parallel for schedule(static,1)
for(int i=0; i<N; i++) foo(i);
is equivalent to
#pragma omp parallel
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
for(int i=ithread; i<N; i+=nthreads) foo(i);
}
From this you can see that it's quite trivial to implement static scheduling and so the overhead is negligible.
On the other hand if you want to implement schedule(dynamic) (which is the same as schedule(dynamic,1)) by hand it's more complicated:
int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
#pragma omp atomic capture
i = cnt++;
if(i>=N) break;
foo(i);
}
This requires OpenMP >=3.1. If you wanted to do this with OpenMP 2.0 (for MSVC) you would need to use critical like this
int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
#pragma omp critical
i = cnt++;
if(i>=N) break;
foo(i);
}
Here is an equivalent to schedule(dynamic,chunk) (I have not optimized this using atomic accesss):
int cnt = 0;
int chunk = 5;
#pragma omp parallel
{
int start, finish;
do {
#pragma omp critical
{
start = cnt;
finish = cnt+chunk < N ? cnt+chunk : N;
cnt += chunk;
}
for(int i=start; i<finish; i++) foo(i);
} while(finish<N);
}
Clearly using atomic accesses is going to cause more overhead. This also shows why using larger chunks for schedule(dynamic,chunk) can reduce the overhead.

OpenMP: having a complete 'for' loop into each thread

I have this code:
#pragma omp parallel
{
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
// and so on... up to 5 or 6 of myObject_x
// Then I sum up the buffers and do something with them
float result;
for (int i=0; i<given_number; ++i)
result = myBuffer_1[i] + myBuffer_2[i];
// do something with result
If I run this code, I get what I expect but the CPU usage looks quite high. Instead, if I run it normally without OpenMP I get the same results but the CPU usage is much lower, despite running in a single thread.
I don't want to specify a number of threads, I wish the program pick the max number of threads according to the CPU capabilities, but I want that each for loop runs entirely in its own thread. How can I do that?
Also, my expectation is that the for loop for myBuffer_1 runs a thread, the other for loop runs another thread, and the rest runs in the 'master' thread. Is this correct?
#pragma omp single has an implicit barrier at the end, you need to use #pragma omp single nowait if you want the two single block run concurrently.
However, for your requirement, using section might be a better idea
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
}