I have a for loop which access many memory pointers in each iteration. For each of these memory pointers, I created an index. My problem is that when I try to use open mp to parallelize this loop, I get the following error:
error: expected iteration declaration or initialization
I thought that this error would be one of the following:
-Open MP does not accept increment different than ++ or --
-Open MP does not accept multiple initialization in a loop
For reasons regarding performance, it is important to me to use these multiple indexes. Does anybody know the answer for my problem?
Here it is the code:
#pragma omp parallel default(shared)
{
int tID = omp_get_thread_num();
int i, iCF, iPF, iNF, iPJG, iCJG, iNJG, iPRJG, iCRJG;
##pragma omp for nowait
for(i=0, iCF=0, iPF=0, iNF=sqrBcksDim, iPJG=0, iCJG=0, iNJG=sqrBcksSize, iPRJG=0, iCRJG=0 ; iCF<RHSArraySize ; iPF=iCF, iCF=iNF, iNF+=sqrBcksDim, iPJG=iCJG, iCJG=iNJG, iNJG+=sqrBcksSize, iPRJG=iCRJG, iCRJG+=rectBcksSize, ++i)
{
}
}
Well, looking at that third clause, you’re doing a lot of inherently sequential computations that depend on the program state at the end of the previous iteration of the loop. You could move all of those operations but the += and ++ updates inside the body of the loop, and from the look of things possibly make the loop condition depend on iNF, correct? But some of them look like they still might be ordered. For a parallel algorithm, are there closed-form initializers you could use inside the loop body that depend only on i or something loop-invariant?
If not, and the inputs to each iteration really do depend on the results of previous iterations of the loop, then it’s not a parallel algorithm
One suggestion:
Here’s how I would try to fix this. You can only initialize i and increment it by a constant within the loop; however, you can equivalently move all the rest of those operations inside the loop. For example, I don’t know what else goes on inside the loop body, but if iCF is initialized to 0, iNF to sqrBcksDim and at the end of each iteration, iCF is set to the previous value of iNF and iNF is incremented by sqrBcksDim, it looks like you could rewrite the loop into something like:
int i;
#pragma omp for nowait
for ( i=0; i < RHSArraySize/sqrBcksDim; ++i )
{
const int iCF = i*sqrBcksDim;
const int iNF = iCF + sqrBcksDim;
// ...
}
Can you do that for your other variables? If you really have a parallel algorithm here, you should be able to, because each run of the loop should only depend on i and loop invariants, which you can use in your initializers. You’ll need to declare a variable outside the loop if you’re going to refer to it outside the body of the loop, but for the time being, just declare a new local variable and don’t read any variable outside the loop that you also write to inside the loop. If there are no implicit sequential dependencies, you should be able to initialize them all at the start of the loop body.
You might not end up doing it that way, but it might help you think about how to refactor.
Related
I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...
#pragma omp parallel for
for (j=0;j <m; j++) {
some work;
for (i= 0; i < n ; i++) {
p =b[i];
if (P< 0 && k < m) {
a[k] = c[i]; k++ ;
} else {
x=c[i];
}
}
some work
}
The outer loop is in parallel, and the inner loop updates k. The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k], but the proper order of k is not kept.
Some threads will update k and a[k], and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?
For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.
How do I keep the same order and ensure parallelism at the same time?
Note: this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.
How do I keep the same order and ensure parallelism at the same time?
Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.
Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.
With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k. Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.
It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of "some work" and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.
> Instead of giving a[n]={0,1,2,3,.......n} , it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.
There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n, inclusive, is n * (n + 1) / 2. You do not need a reduction for this.
If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[i] = i;
k += i;
}
That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like
type_of_k k0 = k;
k = 0;
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[k0 + i] = i;
k += k0 + i;
}
Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++.
It sounds like you're essentially filling in a with a filter of entries from c, and want to preserve their order. If this is the only use k has, some other methods spring to mind:
Always write a[i], but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.
Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k, and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.
Even with some sequential dependencies you can do optimizations, e.g. a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). E.g. parallel prefix sum, openmp discussion on stack overflow. This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:
Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.
It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.
I am currently working on parallelizing a nested for loop using C++ and OpenMP. Without going into the actual details of the program, I have constructed a basic example on the concepts I am using below:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
for(int i=0; i < distance.size; i++){
\\some work
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I attempted to parallelize the above code in the following way:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
#pragma omp parallel for default(shared)
for(int i=0; i < distance.size; i++){
\\some work
#pragma omp parallel for reduction(+:var)
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I then compared the serial program output with the parallel program output and I got incorrect result. I know that this is mainly due to the fact that floating point arithmetic is not associative. But are there any workarounds to this that give exact results?
Although the lack of associativity of floating point arithmetic might be an issue in some cases, the code you show here exposes a much more essential problem which you need to address first: the status of the var variable in the outer loop.
Indeed, since var is modified inside the i loop, even if only in the j part of the i loop, it needs to be "privatized" somehow. Now the exact status it needs to get depends on the value you expect it to store upon exit of the enclosing parallel region:
If you don't care about its value at all, just declare it private (or better, declare it inside the parallel region.
If you need its final value at the end of the i loop, and considering it accumulates a sum of values, most likely you'll need to declare it reduction(+:), although lastprivate might also be what you want (impossible to say without further details)
If private or lastprivate was all you needed, but you also need its initial value upon entrance of the parallel region, then you'll have to consider adding firstprivate too (no need of that if you went for reduction as it is already been taken care of)
That should be enough for fixing your issue.
Now, in your snippet, you also parallelized the inner loop. That is usually a bad idea to go for nested parallelism. So unless you have a very compelling reason for doing so, you will likely get much better performance by only parallelizing the outer loop, and leaving the inner loop alone. That won't mean the inner loop won't benefit from the parallelization, but rather that several instances of the inner loop will be computed in parallel (each one being sequential admittedly, but the whole process is parallel).
A nice side effect of removing the inner loop's parallelization (in addition to making the code faster) is that now all accumulations inside the privates var variables are done in the same order as when not in parallel. Therefore, your (hypothetical) floating point arithmetic issues inside the outer loop will now have disappeared, and only if you needed the final reduction upon exit of the parallel region might you still face them there.
I'm coding a simple function using std::vector below where input is an integer vector and the function proceeds the iteration based on the number of elements in the vector.
In terms of space and time efficiency, which following code are suitable?
HugeClass is actually a Big Integer which contains complex arithmetic while I put a simple arithmetic below for simplicity.
1) Gives a dimension of vector
void (HugeClass& huge, std::vector<int>& vec, int dim){
for(int i=0;i<dim;i++){
huge+=vec[i];
}
}
2) Calls a std::vector.size() to iterate
void (HugeClass& huge, std::vector<int>& vec){
for(int i=0;i<vec.size();i++){
huge+=vec[i];
}
}
dim can range in [100,1000000]
The syntax of a for loop in C++ is:
for ( init; condition; increment ) {
statement(s);
}
Here is the flow of control in a for loop:
The init step is executed first, and only once. This step allows you to declare and initialize any loop control variables. You are not required to put a statement here, as long as a semicolon appears.
Next, the condition is evaluated. If it is true, the body of the loop is executed. If it is false, the body of the loop does not execute and flow of control jumps to the next statement just after the for loop.
After the body of the for loop executes, the flow of control jumps back up to the increment statement. This statement allows you to update any loop control variables. This statement can be left blank, as long as a semicolon appears after the condition.
So in the case of
for(int i=0;i<vec.size();i++) {
huge+=vec[i];
}
vec.size() called each time but is probably inlined, and is probably a simple function.
On top of which
A smart enough optimizer may be able to deduce that it is a loop invariant with no side effects and elide it entirely (this is easier if the code is inlined, but may be possible even if it is not if the compiler does global optimization)
Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.
I was trying to compile the following code:
#pragma omp parallel shared (j)
{
#pragma omp for schedule(dynamic)
for(i = 0; i != j; i++)
{
// do something
}
}
but I got the following error: error: invalid controlling predicate.
The OpenMP standard states that for parallel for constructor it "only" allows one of the following operators: <, <=, > >=.
I do not understand the rationale for not allowing i != j. I could understand, in the case of the static schedule, since the compiler needs to pre-compute the number of iterations assigned to each thread. But I can't understand why this limitation in such case for example. Any clues?
EDIT: even if I make for(i = 0; i != 100; i++), although I could just have put "<" or "<=" .
.
I sent an email to OpenMP developers about this subject, the answer I got:
For signed int, the wrap around behavior is undefined. If we allow !=, programmers may get unexpected tripcount. The problem is whether the compiler can generate code to compute a trip count for the loop.
For a simple loop, like:
for( i = 0; i < n; ++i )
the compiler can determine that there are 'n' iterations, if n>=0, and zero iterations if n < 0.
For a loop like:
for( i = 0; i != n; ++i )
again, a compiler should be able to determine that there are 'n' iterations, if n >= 0; if n < 0, we don't know how many iterations it has.
For a loop like:
for( i = 0; i < n; i += 2 )
the compiler can generate code to compute the trip count (loop iteration count) as floor((n+1)/2) if n >= 0, and 0 if n < 0.
For a loop like:
for( i = 0; i != n; i += 2 )
the compiler can't determine whether 'i' will ever hit 'n'. What if 'n' is an odd number?
For a loop like:
for( i = 0; i < n; i += k )
the compiler can generate code to compute the trip count as floor((n+k-1)/k) if n >= 0, and 0 if n < 0, because the compiler knows that the loop must count up; in this case, if k < 0, it's not a legal OpenMP program.
For a loop like:
for( i = 0; i != n; i += k )
the compiler doesn't even know if i is counting up or down. It doesn't know if 'i' will ever hit 'n'. It may be an infinite loop.
Credits: OpenMP ARB
Contrary to what it may look like, schedule(dynamic) does not work with dynamic number of elements. Rather the assignment of iteration blocks to threads is what is dynamic. With static scheduling this assignment is precomputed at the beginning of the worksharing construct. With dynamic scheduling iteration blocks are given out to threads on the first come, first served basis.
The OpenMP standard is pretty clear that the amount of iteratons is precomputed once the workshare construct is encountered, hence the loop counter may not be modified inside the body of the loop (OpenMP 3.1 specification, §2.5.1 - Loop Construct):
The iteration count for each associated loop is computed before entry to the outermost
loop. If execution of any associated loop changes any of the values used to compute any
of the iteration counts, then the behavior is unspecified.
The integer type (or kind, for Fortran) used to compute the iteration count for the
collapsed loop is implementation defined.
A worksharing loop has logical iterations numbered 0,1,...,N-1 where N is the number of
loop iterations, and the logical numbering denotes the sequence in which the iterations
would be executed if the associated loop(s) were executed by a single thread. The
schedule clause specifies how iterations of the associated loops are divided into
contiguous non-empty subsets, called chunks, and how these chunks are distributed
among threads of the team. Each thread executes its assigned chunk(s) in the context of
its implicit task. The chunk_size expression is evaluated using the original list items of any variables that are made private in the loop construct. It is unspecified whether, in what order, or how many times, any side-effects of the evaluation of this expression occur. The use of a variable in a schedule clause expression of a loop construct causes an implicit reference to the variable in all enclosing constructs.
The rationale behind these relational operator restriction is quite simple - it provides clear indication on what is the direction of the loop, it alows easy computation of the number of iterations, and it provides similar semantics of the OpenMP worksharing directive in C/C++ and Fortran. Also other relational operations would require close inspection of the loop body in order to understand how the loop goes which would be unaceptable in many cases and would make the implementation cumbersome.
OpenMP 3.0 introduced the explicit task construct which allows for parallelisation of loops with unknown number of iterations. There is a catch though: tasks introduce some severe overhead and the one task per loop iteration only makes sense if these iterations take quite some time to be executed. Otherwise the overhead would dominate the execution time.
The answer is simple.
OpenMP does not allow premature termination of a team of threads.
With == or !=, OpenMP has no way of determining when the loop stops.
1. One or more threads could hit the termination condition, which might not be unique.
2. OpenMP has no way to shut down the other threads that might never detect the condition.
If I were to see the statement
for(i = 0; i != j; i++)
used instead of the statement
for(i = 0; i < j; i++)
I would be left wondering why the programmer had made that choice, never mind that it can mean the same thing. It may be that OpenMP is making a hard syntactic choice in order to force a certain clarity of code.
Here's code which raises challenges for the use of != and may help explain why it isn't allowed.
#include <cstdio>
int main(){
int j=10;
#pragma omp parallel for
for(int i = 0; i < j; i++){
printf("%d\n",i++);
}
}
notice that i is incremented in both the for statement as well as within the loop itself leading to the possibility (but not the guarantee) of an infinite loop.
If the predicate is < then the loop's behavior can still be well-defined in a parallel context without the compiler having to check within the loop for changes to i and determining how those changes will affect the loop's bounds.
If the predicate is != then the loop's behavior is no longer well-defined and it may be infinite in extent, preventing easy parallel subdivision.
I think there is perhaps no good reason other than having extended existing functionality to get this far.
IIRC originally these had to be static so that it could determine at compile time how to generate the loop code... it could just be a hangover from that.