openmp critical section inside for loop - c++

I have the following code that updates something inside a for loop, with another for loop coming after it. However, I got the error: "expected a declaration" at the beginning of the second loop. The problem seems to be at the "critical" part, because if I delete it, the error will be gone. I'm fresh new to openMP and I was following an example here: http://www.viva64.com/en/a/0054/#ID0EBUEM (refer to "5. Too many entries to critical sections"). Anybody has any idea what I'm doing wrong here?
Besides, is it true that "If the comparison is performed before the critical section, the critical section will not be entered during all iterations of the loop"?
Another thing is that I actually want to parallelize the two loops at the same time, but since the operations inside the loops are different, I use two thread teams here, hoping that if there are threads that are not needed in the first loop, they can start executing the second loop immediately. Will this work?
double maxValue = 0.0;
#pragma omp parallel for schedule (dynamic) //first loop
for (int i = 0; i < n; i++){
if (some condition satisfied)
{
#pragma omp atomic
count++;
continue;
}
double tmp = getValue(i);
#pragma omp flush(maxValue)
if (tmp > maxValue){
#pragma omp critical(updateMaxValue){
if (tmp > maxValue){
maxValue = tmp;
//update some other variables
...
}
}
}
}
#pragma omp parallel for schedule (dynamic) //second loop
for (int i = 0; i < m; i++){
//some operations...
}
#pragma omp barrier
Sorry that I have so many questions and thanks in advance!

However, I got the error: "expected a declaration" at the beginning of the second loop.
You have a syntax error - an opening brace, if present, must be moved to a new line:
#pragma omp critical(updateMaxValue){
// ~^~
should be changed to:
#pragma omp critical(updateMaxValue)
{
(You don't need it actually, since the if-statement that follows is a structured block).
Another thing is that I actually want to parallelize the two loops at the same time, but since the operations inside the loops are different, I use two thread teams here, hoping that if there are threads that are not needed in the first loop, they can start executing the second loop immediately.
Use a single parallel region, and then a nowait clause on the first for-loop:
#pragma omp parallel
{
#pragma omp for schedule(dynamic) nowait
// ~~~~~^
for (int i = 0; i < n; i++)
{
// ...
}
#pragma omp for schedule(dynamic)
for (int i = 0; i < m; i++)
{
// ...
}
}

Related

openmp tasks in nested loops

I am trying to write the following piece of code.
#pragma omp parallel
{
int .... some variables
for (int x:map){
int ...
#pragma omp single
{
#pragma omp task firstprivate(x,..) depend(out:a)
{
assigning the variables some values
}
for (int loop over j)
{
#pragma omp task firstprivate(j) depend (in:a) depend (out:b)
{
}
third loop over k
#pragma omp task depend(in:a,b)
{
}
}
}
}
Is this valid? The threads are getting formed but they are not getting into either (the loop over j i.e. the second loop) or the 3rd loop (I checked by print statements).
Please suggest how to correct this.
I tried printing inside the 2 loops and saw that nothing is getting printed which implies the threads don't enter into the loops at all. I was expecting that the work would be distributed among the threads but unfortunately I am unable to achieve this.
Minimum reproducible example:
(as asked in the comments I am making an example)
int a,b;
#pragma omp parallel
{
#pragma omp single
{for (int &x:map)
{
#pragma omp task
for(int i=0;i<x.second;++i)
{
vector<int> val = m2[i];
for (int j=0;j<val.size();++j)
{
#pragma omp critical
update a global map m3.
}
}
}
}

replacement for #pragma omp critical (C++)

I am using openMp on a nested loop which works like this
#pragma omp parallel shared(vector1) private(i,j)
{
#pragma omp for schedule(dynamic)
for (i = 0; i < vector1.size(); ++i){
//some code here
for (j = 0; j < vector1.size(); ++j){
//some other code goes here
#pragma omp critical
A+=B;
}
C +=A;
}
}
the Problem here is that my code is doing a lot of the computation in the A+=B part of the code. Therefore by making it critical, I am not achieving the speedup I would like. (In fact there appears to be some overhead since my program is taking longer to execute then it being sequentially written).
I tried using
#pragma omp reduction private(B) reduction(+:A)
A+=B
this speeds up the execution time however is seems that it does not take care of race conditions like the critical clause since I am not getting the same results of A.
Is there an alternative to this i can try?
Unless you want to go through the trouble of making your Vector3 class thread-safe or rewriting your operations for use with an std::atomic<Vector3>, both of which would still suffer from performance drawbacks (although not as serious as using a critical section), you can actually mimic the behaviour of OpenMP reduction:
#pragma omp parallel // no need to declare variables declared outside/inside as shared/private
{
Vector3 A{}, LocalC{}; // both thread-private
#pragma omp for schedule(dynamic)
for (i = 0; i < vector1.size(); ++i){
//some code here
for (j = 0; j < vector1.size(); ++j){
//some other code goes here
A += B; // does not need a barrier
}
LocalC += A; // does not need a barrier
}
#pragma omp critical
C += LocalC;
}
NB that this assumes that you don't access A for reading within your "some code" comments, but you shouldn't anyway if you ever thought of using a reduction clause.

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

OpenMP: having a complete 'for' loop into each thread

I have this code:
#pragma omp parallel
{
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
// and so on... up to 5 or 6 of myObject_x
// Then I sum up the buffers and do something with them
float result;
for (int i=0; i<given_number; ++i)
result = myBuffer_1[i] + myBuffer_2[i];
// do something with result
If I run this code, I get what I expect but the CPU usage looks quite high. Instead, if I run it normally without OpenMP I get the same results but the CPU usage is much lower, despite running in a single thread.
I don't want to specify a number of threads, I wish the program pick the max number of threads according to the CPU capabilities, but I want that each for loop runs entirely in its own thread. How can I do that?
Also, my expectation is that the for loop for myBuffer_1 runs a thread, the other for loop runs another thread, and the rest runs in the 'master' thread. Is this correct?
#pragma omp single has an implicit barrier at the end, you need to use #pragma omp single nowait if you want the two single block run concurrently.
However, for your requirement, using section might be a better idea
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
}

How to determine if a loop using "task" is parallelized?

I'm totally new to openmp and learning how to parallelized loops using task. I made the following loop:
#pragma omp parallel default(none) firstprivate(left) private(i) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
#pragma omp task
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp taskwait
}
I'm not sure if it's parallelized properly as it's taking more time than it's supposed to. How can I improve my code?
In this case, the task directive is totally irrelevant since (#pragma omp for) do the job.
Task is used for unbounded loops.