I know that you cannot have a break statement for an OpenMP loop, but I was wondering if there is any workaround while still the benefiting from parallelism. Basically I have 'for' loop, that loops through the elements of a large vector looking for one element that satisfies a certain condition. However there is only one element that will satisfy the condition so once that is found we can break out of the loop, Thanks in advance
for(int i = 0; i <= 100000; ++i)
{
if(element[i] ...)
{
....
break;
}
}
See this snippet:
volatile bool flag=false;
#pragma omp parallel for shared(flag)
for(int i=0; i<=100000; ++i)
{
if(flag) continue;
if(element[i] ...)
{
...
flag=true;
}
}
This situation is more suitable for pthread.
You could try to manually do what the openmp for loop does, using a while loop:
const int N = 100000;
std::atomic<bool> go(true);
uint give = 0;
#pragma omp parallel
{
uint i, stop;
#pragma omp critical
{
i = give;
give += N/omp_get_num_threads();
stop = give;
if(omp_get_thread_num() == omp_get_num_threads()-1)
stop = N;
}
while(i < stop && go)
{
...
if(element[i]...)
{
go = false;
}
i++;
}
}
This way you have to test "go" each cycle, but that should not matter that much. More important is that this would correspond to a "static" omp for loop, which is only useful if you can expect all iterations to take a similar amount of time. Otherwise, 3 threads may be already finished while one still has halfway to got...
I would probably do (copied a bit from yyfn)
volatile bool flag=false;
for(int j=0; j<=100 && !flag; ++j) {
int base = 1000*j;
#pragma omp parallel for shared(flag)
for(int i = 0; i <= 1000; ++i)
{
if(flag) continue;
if(element[i+base] ...)
{
....
flag=true;
}
}
}
Here is a simpler version of the accepted answer.
int ielement = -1;
#pragma omp parallel
{
int i = omp_get_thread_num()*n/omp_get_num_threads();
int stop = (omp_get_thread_num()+1)*n/omp_get_num_threads();
for(;i <stop && ielement<0; ++i){
if(element[i]) {
ielement = i;
}
}
}
bool foundCondition = false;
#pragma omp parallel for
for(int i = 0; i <= 100000; i++)
{
// We can't break out of a parallel for loop, so this is the next best thing.
if (foundCondition == false && satisfiesComplicatedCondition(element[i]))
{
// This is definitely needed if more than one element could satisfy the
// condition and you are looking for the first one. Probably still a
// good idea even if there can only be one.
#pragma omp critical
{
// do something, store element[i], or whatever you need to do here
....
foundCondition = true;
}
}
}
Related
My question pertains to nested parallelism and OpenMP. Let's start with the following single threaded code snippet:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Now let's say we want to make our calls to performAnotherTask in parallel utilizing OpenMP.
So we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
My understanding is that the calls to performAnotherTask will be performed in parallel, and by default openMP will try and use all available threads on your machine (perhaps this assumption is incorrect).
Let's say we now also want to parallelize the calls to performTask such that we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
How will this work? Will both the for loops still be multithreaded? Can we say anything on the number of threads each loop will use? Is there a way to enforce the inner for loop (within performTask) to only utilize a single thread while the outer for loop uses all available threads?
In your last example, the execution behavior depends on a few environmental settings.
First, OpenMP indeed does support such patterns, but by default disables parallel execution in a nested parallel region. To enabled it, you must set OMP_NESTED=true or call omp_set_nested(1) in your code. Then the support for nested parallel execution is enabled.
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Second, when OpenMP reaches the outer parallel region, it might grab all the available cores and assume that it can execute a thread on them, so you might want to reduce the number of threads for the outer level, so that some cores are available for in nested region. Say, if you have 32 cores, you could do this:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for num_threads(8)
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for num_threads(4)
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
The outer parallel region will execute using 4 threads, each of which will execute the inner region with 8 threads. Note, each of the 4 outer threads will be one of the master threads of the four concurrently executing nested parallel regions. If you want to be more flexible, you can inject the number of threads to use for each level using the environment variable OMP_NUM_THREADS. If you set it to OMP_NUM_THREADS=4,8 you get the same behavior as the above the first code snippet that I have posted.
The problem with the coding pattern is that you need to be careful in balancing each level to not overload the system or get load imbalances between the nested parallel regions. An alternative solution is to use OpenMP tasks instead:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Here each of the taskloop constructs will generate OpenMP task that are scheduled to execute on the threads that have been created by the single parallel region in the code. Caveate is that tasks are inherently dynamic in their behavior, so you might lose locality properties as you do not know where exactly the tasks will be executing in the system.
Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts
If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.
My code looks-like as below:
#pragma omp parallel for num_threads(5)
for(int i = 0; i < N; i++)
{
//some code
//#pragma omp parallel for reduction(+ : S_x,S_y,S_theta)
for(int j = 0; j < N; j++)
{
if (j==i) continue;
// some code
for(int ky = -1; ky<= 1; ky++)
{
for(int kx = -1; kx<= 1; kx++)
{
//some code
if (r_ij_square > l0_two)
{
//some code
}
}
}
}
//some code
}
I'm not sure if continue in above code could cause any prblem or not. To avoid any problem, I have ignored second #pragma in above code by //. But I'm not still sure if above code could cause any problem due to using continue or not? My question is if above code could cause problem or not, and if yes, how can I remove the problem?
When searching, I found these two sentences loops with "restricted" continue statements can be parallelized. or Only an iteration of the innermost associated loop may be curtailed by a continue statement. . But I don't know what do they mean exactly
I'm trying to learn parallel programming with OpenMP and I'm interested in parallelizing the following do while loop with several while loop inside it:
do {
while(left < (length - 1) && data[left] <= pivot) left++;
while(right > 0 && data[right] >= pivot) right--;
/* swap elements */
if(left < right){
temp = data[left];
data[left] = data[right];
data[right] = temp;
}
} while(left < right);
I haven't actually figured out how to parallelize while and do while loops, couldn't find any resource where it specifically describes how to parallelize while and do while loops. I have found instructions for for loops, but I couldn't make any assumption for while and do while loops from that. So, could you please describe how I can parallelize this loops that I provided here?
EDIT
I have transformed the do while loop to the following code where only for loop is used.
for(i = 1; i<length-1; i++)
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
for(j=length-1; j > 0; j--)
{
if(data[right] < pivot)
{
j = 0;
}
else
{
right = j;
}
}
/* swap elements */
if(left < right)
{
temp = data[left];
data[left] = data[right];
data[right] = temp;
}
int leftCopy = left;
int rightCopy = right;
for(int leftCopy = left; leftCopy<right;leftCopy++)
{
for(int new_i = left; new_i<length-1; new_i++)
{
if(data[left] > pivot)
{
new_i = length;
}
else
{
left = new_i;
}
}
for(int new_j=right; new_j > 0; new_j--)
{
if(data[right] < pivot)
{
new_j = 0;
}
else
{
right = new_j;
}
}
leftCopy = left;
/* swap elements */
if(left < right)
{
temp = data[left];
data[left] = data[right];
data[right] = temp;
}
}
This code works fine and produces correct result, but when I tried to parallelize the parts of above stated code, by changing the first two for loops to the following:
#pragma omp parallel default(none) firstprivate(left) private(i,tid) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp parallel default(none) firstprivate(right) private(j) shared(length, pivot, data)
{
#pragma omp for
for(j=length-1; j > 0; j--)
{
if(data[right] < pivot)
{
j = 0;
}
else
{
right = j;
}
}
}
The speed is worse than the non-parallelized code. Please help me identify my problem.
Thanks
First of all, sorting algorithms are very hard to parallelize with OpenMP parallel loops. This is because the loop trip count is not deterministic but depends on the input set values that are read every iteration.
I don't think having loop conditions such as data[left] <= pivot is going to work well, since OpenMP library does not know exactly how to partition the iteration space among the threads.
If you are still interested in parallel sorting algorithms, I suggest you to read the literature first, to see those algorithms that really worth implementing due to their scalability. If you just want to learn OpenMP, I suggest you start with easier algorithms such as bucket-sort, where the number of buckets is well known and does not frequently change.
Regarding the example you try to parallelize, while loops are not directly supported by OpenMP because the number of iterations (loop trip count) is not deterministic (otherwise, it is easy to transform them into for loops). Therefore, it is not possible to distribute the iterations among the threads. In addition, it is common for while loops to check for a condition using last iteration's result. This is called Read-after-Write or true-dependency and cannot be parallelized.
Your slowdown problem might be alleviated if you try to minimize the number of omp parallel clauses. In addition, try to move them out of all your loops. These clauses may create and join the additional threads that are used in the parallel parts of the code, which is expensive.
You can still synchronize threads inside parallel blocks, so that the outcome is similar. In fact, all threads wait for each other at the end of a omp for clause by default, so that this makes things even easier.
#pragma omp parallel default(none) firstprivate(right,left) private(i,j) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
#pragma omp for
for(j=length-1; j > 0; j--)
{
if(data[right] < pivot)
{
j = 0;
}
else
{
right = j;
}
}
} // end omp parallel
I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}
Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael
You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.
If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.