My code looks-like as below:
#pragma omp parallel for num_threads(5)
for(int i = 0; i < N; i++)
{
//some code
//#pragma omp parallel for reduction(+ : S_x,S_y,S_theta)
for(int j = 0; j < N; j++)
{
if (j==i) continue;
// some code
for(int ky = -1; ky<= 1; ky++)
{
for(int kx = -1; kx<= 1; kx++)
{
//some code
if (r_ij_square > l0_two)
{
//some code
}
}
}
}
//some code
}
I'm not sure if continue in above code could cause any prblem or not. To avoid any problem, I have ignored second #pragma in above code by //. But I'm not still sure if above code could cause any problem due to using continue or not? My question is if above code could cause problem or not, and if yes, how can I remove the problem?
When searching, I found these two sentences loops with "restricted" continue statements can be parallelized. or Only an iteration of the innermost associated loop may be curtailed by a continue statement. . But I don't know what do they mean exactly
Related
I need to iterate over an array and assign each element according to a calculation that requires some iteration itself. Removing all unnecessary details the program boils down to something like this.
float output[n];
const float input[n] = ...;
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
some_calculation does not alter its arguments, nor does it have an internal state so its thread safe. Looking at the loops, I understand that the outer loop is thread-safe because different iterations output to different memory locations (different output[i]) and the shared elements of input are never altered while the loop runs, but the inner loop is not thread safe because it has a race condition on output[i] because it is altered in all iterations.
Consequently, I'd like to spawn threads and get them working for different values of i but the whole iteration over input should be local to each thread so as not to introduce a race condition on output[i]. I think the following achieves this.
std::array<float, n> output;
const std::array<float, n> input[n];
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
I'm not sure how this handles the inner loop. Threads working on different is should be able to run the loop in parallel but I don't understand if I'm allowing them to without another #pragma omp directive. On the other hand I don't want to accidentally allow threads to run for different values of j over the same i because that introduces a race condition. I'm also not sure if I need some extra specification on how the two arrays should be handled.
Lastly, if this code is in a function that is going to get called repeatedly, does it need the parallel directive or can that be called once before my main loop begins like so.
void iterative_step(const std::array<float, n> &input, const std::array<float, n> &output) {
// Threads have already been spawned
#pragma omp for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
int main() {
...
// spawn threads once, but not for this loop
#pragma omp parallel
while (...) {
iterative_step(input, output);
}
...
}
I looked through various other questions but they were about different problems with different race conditions and I'm confused as to how to generalize the answers.
You don't want the omp parallel in main. The omp for you use will only create/reuse threads for the following for (int i loop. For any particular value of i, the j loop will run entirely on one thread.
One other thing that would help a little is to compute your output[i] result into a local variable, then store that into output[i] once you're done with the j loop.
I am designing a program that will test to see whether a valid sudoku puzzle solution is given to the program or not. I first designed it in C++ but now I want to try to make it parallel. The program compiles fine without errors.
First I had to figure out a way to deal with using a return statement inside of a structured block. I just decided to make an array of bool's that are initialized to true. However the output from this function is false and I know for a fact the solution I am submitting is true. I am new to openMP and was wondering if anyone could help me out?
I have a feeling the issue is with my variable a getting set back to 0 and maybe also with my other variable nextSudokuNum getting set back to 1.
bool test_rows(int sudoku[9][9])
{
int i, j, a;
int nextSudokuNum = 1;
bool rowReturn[9];
#pragma omp parallel for private(i)
for(i = 0; i < 9; i++)
{
rowReturn[i] = true;
}
#pragma omp parallel for private(i,j) \
reduction(+: a, nextSudokuNum)
for(i = 0; i < 9; i++)
{
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
nextSudokuNum = 1;
}
for(i = 0; i < 9; i++)
{
if(rowReturn[i] == false) {
cout << "Invalid Sudoku Solution(Next Valid Sudoku Number Not Found)" << endl;
cout << "Check row " << (i+1) << endl;
return false;
}
}
cout << "Valid sudoku rows(Returning true)" << endl;
return true;
}
Disclaimer:
First off, do not parallelize very small loops or loops which execute nearly instantaneously. The overhead of creating the threads will dominate the benefit you would get by executing the inner statements of the loop in parallel. So unless each iteration you are parallelizing performs thousands-millions of FLOPs, the serial version of the code will run faster than the parallel version of the code.
Therefore, a better plan for parallelizing your (probable) tasks is to parallelize at a higher level. That is, presumably you are calling test_rows(sudoku), test_columns(sudoku), and test_box(sudoku) from one function somewhere else. What you can do is call these three serial functions in parallel using OpenMP sections where calling each of these three functions is a separate OpenMP section. This will only benefit from using 3 cores of your CPU, but presumably you are doing this on your laptop anyway so you probably only have 2 or 4 anyway.
Now to your actual problems:
You are not parallelizing over j, but merely over i. Therefore, you can see that your variable nextSudokuNum is not being reduced; for every i iteration, nextSudokuNum is self-contained. Thus it should be initialized inside the loop and made private in the #pragma omp parallel clause.
Likewise, you are not performing a reduction over a either. For every iteration of i, a is set, compared to, and incremented internally. Again it should be a private variable.
Therefore, your new code should look like:
#pragma omp parallel for private(i,j,a,nextSudokuNum)
for(i = 0; i < 9; i++)
{
// all private variables must be set internal to parallel region before being used
nextSudokuNum = 1;
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
}
Every time I try to print out the threadID, and regardless of where I put the print statement, it always prints the threadId = 0. It looks like there is only one thread being created, but why? I don't see what I'm doing wrong. Also, I've checked and num_t = 16. I've also made sure that I use -fopenmp when compiling.
omp_set_num_threads(num_t);
#pragma omp parallel shared(a,b,c) private(i,j,k) num_threads(num_t)
{
#pragma omp for schedule(static)
for (int i = 0; i < m; i++)
{
std::cout << omp_get_thread_num()<< "\n";
for (int j = 0; (j < n); j++)
{
c[i + j*m] = 0.0;
for (int k = 0; k < q; k++)
{
c[i+j*m] += a[i*q + k]*b[j*q + k];
}
}
}
}
To test first, I recommend you to use this:
#pragma omp parallel for private(...) shared(...) schedule(...) num_threads (X)
where "X" is the number of threads to be created. In theory, the previous line must have a similar effect to yours, but C++ can be picky sometimes (specially with the "parallel" clause)
Btw, maybe is not your case, but be careful using "text keys" {}. OpenMP's functionality can be different depending on adding them to the code block or not.
I know that you cannot have a break statement for an OpenMP loop, but I was wondering if there is any workaround while still the benefiting from parallelism. Basically I have 'for' loop, that loops through the elements of a large vector looking for one element that satisfies a certain condition. However there is only one element that will satisfy the condition so once that is found we can break out of the loop, Thanks in advance
for(int i = 0; i <= 100000; ++i)
{
if(element[i] ...)
{
....
break;
}
}
See this snippet:
volatile bool flag=false;
#pragma omp parallel for shared(flag)
for(int i=0; i<=100000; ++i)
{
if(flag) continue;
if(element[i] ...)
{
...
flag=true;
}
}
This situation is more suitable for pthread.
You could try to manually do what the openmp for loop does, using a while loop:
const int N = 100000;
std::atomic<bool> go(true);
uint give = 0;
#pragma omp parallel
{
uint i, stop;
#pragma omp critical
{
i = give;
give += N/omp_get_num_threads();
stop = give;
if(omp_get_thread_num() == omp_get_num_threads()-1)
stop = N;
}
while(i < stop && go)
{
...
if(element[i]...)
{
go = false;
}
i++;
}
}
This way you have to test "go" each cycle, but that should not matter that much. More important is that this would correspond to a "static" omp for loop, which is only useful if you can expect all iterations to take a similar amount of time. Otherwise, 3 threads may be already finished while one still has halfway to got...
I would probably do (copied a bit from yyfn)
volatile bool flag=false;
for(int j=0; j<=100 && !flag; ++j) {
int base = 1000*j;
#pragma omp parallel for shared(flag)
for(int i = 0; i <= 1000; ++i)
{
if(flag) continue;
if(element[i+base] ...)
{
....
flag=true;
}
}
}
Here is a simpler version of the accepted answer.
int ielement = -1;
#pragma omp parallel
{
int i = omp_get_thread_num()*n/omp_get_num_threads();
int stop = (omp_get_thread_num()+1)*n/omp_get_num_threads();
for(;i <stop && ielement<0; ++i){
if(element[i]) {
ielement = i;
}
}
}
bool foundCondition = false;
#pragma omp parallel for
for(int i = 0; i <= 100000; i++)
{
// We can't break out of a parallel for loop, so this is the next best thing.
if (foundCondition == false && satisfiesComplicatedCondition(element[i]))
{
// This is definitely needed if more than one element could satisfy the
// condition and you are looking for the first one. Probably still a
// good idea even if there can only be one.
#pragma omp critical
{
// do something, store element[i], or whatever you need to do here
....
foundCondition = true;
}
}
}
I am using OpenMP to parallelize loops. In normal case, one would use:
#pragma omp for schedule(static, N_CHUNK)
for(int i = 0; i < N; i++) {
// ...
}
For nested loops, I can put pragma on the inner or outter loop
#pragma omp for schedule(static, N_CHUNK) // can be here...
for(int i = 0; i < N; i++) {
#pragma omp for schedule(static, N_CHUNK) // or here...
for(int k = 0; k < N; k++) {
// both loops have consant number of iterations
// ...
}
}
But! I have two loops, where number of iterations in 2nd loop depends on the 1st loop:
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
What is the best way to balance CPU usage for this kind of loop?
As always:
it depends
profile.
In this case: see also OMP_NESTED environment variable
The things that are going to make the difference here are not being shown:
(non)linear memory addressing (also watch the order of the loops
use of shared variables;
As to your last scenario:
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
I suggest parallelizing the outer loop for the following reasons:
all other things being equal coarse grained parallelizing usually leads to better performance due to
increased cache locality
reduced frequency of locking required
(note that this hinges on assumptions about the loop contents that I can't really make; I'm basing it on my experience of /usual/ parallelized code)
the inner loop might become so short as to be inefficient to parallelize (IOW: the outer loop's range is predictable, the inner loop less so, or doesn't lend itself to static scheduling as well)
nested parallellism rarely scales well
sehe's points -- especially "it depends" and "profile" -- are extremely to the point.
Normally, though, you wouldn't want to have the nested parallel loops as long as the outer loop is big enough to keep all cores busy. The added overhead of another parallel section inside a loop is probably more cost than the benefit from the additional small pieces of work.
The usual way to tackle this is just to schedule the outer loop dynamically, so that the fact that each loop iteration takes a different length of type doesn't cause load-balancing issues (as the i==N-1 iteration completes almost immediately while the i==0 iteration takes forever)
#pragma omp parallel for default(none) shared(N) schedule(dynamic)
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
The collapse pragma is very useful for essentially getting rid of the nesting and is particularly valuable if the outer loop is small (eg, N < num_threads):
#pragma omp parallel for default(none) shared(N) collapse(2)
for(int i = 0; i < N; i++) {
for(int k = 0 ; k < N; k++) {
}
}
This way the two loops are folded into one and there is fewer chunking which means less overhead. But that won't work in this case, because the loop ranges aren't fixed; you can't collapse a loop whose loop bounds change (eg, with i).