I am new in parallel programming with OpenMP and I am just learning how task and data dependency work.
I develop a simply matrix multiplication (using blocks) program where I define a struct as follow:
struct matrix {
int ncols;
int nrows;
double* mat;
};
Now, for each matrix I do a malloc to obtain a linear vector and so a linearised matrix.
The parallelized code that I'm tring to write is this:
#pragma omp parallel
#pragma omp single
for(i=0; i<m1->nrows; i+=BS){
for(j=0; j<m2->ncols; j+=BS){
for(k=0; k<m3->ncols; k+=BS){
#pragma omp task depend(in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
for (ii = i; ii < i+BS; ii++) {
for (jj = j; jj < j+BS; jj++) {
for (kk = k; kk < k+BS; kk++) {
m3->mat[ii * m3->ncols + jj] += m1->mat[ii*m1->ncols+kk] * m2->mat[kk*m2->ncols+jj];
}
}
}
}
}
}
The problem is that the compiler reports some errors but I am sure that it is possible to set dependencies with arrays...
mat_mul_blocks.c:67:42: error: expected ‘]’ before ‘:’ token
67 | #pragma omp task depend(in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
| ^
| ]
mat_mul_blocks.c:67:60: error: expected ‘]’ before ‘:’ token
67 | #pragma omp task depend(in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
| ^
| ]
mat_mul_blocks.c:67:92: error: expected ‘]’ before ‘:’ token
67 | in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
Based on Section 2.19.11 and Section 2.1 of the OpenMP 5.1 specification:
The syntax of the depend clause is as follows:
depend([depend-modifier,] dependence-type: locator-list)
[...]
A locator-list consists of a comma-separated collection of one or more locator list items
[...]
The list items that appear in the depend clause may include array sections or the omp_all_memory reserved locator.
Thus, put it shortly: this is totally conforming for compiler not to implement array section parsing/support. This is actually the case of GCC, while Clang parses them correctly.
Several compilers and runtimes do not care-about/support array sections in depend clause. AFAIK, all mainstream OpenMP implementations (including GOMP of GCC and IOMP of Clang/ICC) just ignore them at runtime so far... The rational is that the dependency analysis would be clearly too much expensive to perform at runtime (some research project tried to implement this, but the performance results were not great). Because the OpenMP runtime used is tightly bound to compilers and because of the previous point, some compilers may not support array sections at all in depend clause (which means it will results in parsing errors in your case).
That being said, based on Section 2.1.5, the array-section syntax you use looks conforming to the OpenMP standard but be aware that locators/array-sections must not overlap. In your case, they seems to overlapp breaking the OpenMP standard and resulting in an undefined behaviour for OpenMP runtimes supporting array sections.
I advise you not to use array sections in depend clause. Instead, you can use pointers with dependency locators predefined outside the directive to avoid compiler parsing issues:
const double* dep1 = &m3->mat[i * m3->ncols + j];
const double* dep2 = &m1->mat[i * m1->ncols + k];
const double* dep3 = &m2->mat[k * m1->ncols + j];
#pragma omp task depend(in: *dep1, *dep2) depend(inout: *dep3)
This code should work on most compilers including GCC, Clang and ICC (MSVC only support OpenMP 2.0 so far). Note that since C++17, you can use the attribute [[maybe_unused]] to avoid compilers generating useless warnings for the unused variables when OpenMP is not enabled/supported by the target compiler (or wrongly detected as unused).
Related
I've come across an unusual problem in trying to make my C/C++ project support both gcc-8 (with OpenMP 4.5) and gcc-9 (with OpenMP 5.0). It is caused by variables defined as const which are to be shared between threads.
For example, here's some code compatible with OpenMP 5, but fails with OpenMP 4.5 (with error 'x' is predetermined 'shared' for 'shared'):
const x = 10;
int i;
# pragma omp parallel \
default (none) \
shared (x) \
private (i)
{
# pragma omp for schedule (static)
for (i=0; i<x; i++)
// etc
}
The above turns out also to be compatible with clang-10, though not clang-3.7.
Here's the same code (just excluding x from shared) compatible with OpenMP 4.5, but fails with OpenMP 5 (with error error: 'numTasks' not specified in enclosing 'parallel'):
const x = 10;
int i;
# pragma omp parallel \
default (none) \
private (i)
{
# pragma omp for schedule (static)
for (i=0; i<x; i++)
// etc
}
It seems like OpenMP 5 no longer intelligently assumes that const variables are shared. The only code which can compile on both is one which needlessly puts x in firstprivate:
const x = 10;
int i;
# pragma omp parallel \
default (none) \
firstprivate (x) \
private (i)
{
# pragma omp for schedule (static)
for (i=0; i<x; i++)
// etc
}
This seems a poor solution, since now I'm paying to copy x into every thread when it's never going to be modified! Note also I cannot make x non-const, since it's actually a const argument to a calling function.
What's the deal, and what's the correct/efficient way to make agnostic code?
Porting to gcc-9 is addressed here. To maintain compatability with both versions of OpenMP, I'll have to either:
remove default (none)
remove all const
Does the OpenMP standard guarantee #pragma omp simd to work, i.e. should the compilation fail if the compiler can't vectorize the code?
#include <cstdint>
void foo(uint32_t r[8], uint16_t* ptr)
{
const uint32_t C = 1000;
#pragma omp simd
for (int j = 0; j < 8; ++j)
if (r[j] < C)
r[j] = *(ptr++);
}
gcc and clang fail to vectorize this but do not complain at all (unless you use -fopt-info-vec-optimized-missed and the like).
No, it is not guaranteed. Relevant portions of the OpenMP 4.5 standard that I could find (emphasis mine):
(1.3) When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread.
(2.8.1) The simd construct can be applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions).
(Appendix C) The number of iterations that are executed concurrently at any given time is implementation defined.
(1.2.7) implementation defined: Behavior that must be documented by the implementation, and is allowed to vary among different compliant implementations. An implementation is allowed to define this behavior as unspecified.
I am having a simple program in which i am having 3 std::vector and using them in for loops. After enabling the compilation flag ON, i am testing whether these loops are optimized or not. But visual studio is showing that loop is not vectorized due to reason 1200. My sample code is as below.
#include <iostream>
#include <vector>
#include <time.h>
int main(char *argv[], int argc)
{
clock_t t=clock();
int tempSize=100;
std::vector<double> tempVec(tempSize);
std::vector<double> tempVec1(tempSize);
std::vector<double> tempVec2(tempSize);
for(int i=0;i<tempSize;i++)
{
tempVec1[i] = 20;
tempVec2[i] = 30;
}
for(int i=0,imax=tempSize;i<imax;i++)
tempVec[i] = tempVec1[i] + tempVec2[i];
t =clock()-t; // stop the clock
std::cout <<"Time in millisecs = " << t/double(CLOCKS_PER_SEC) << std::endl;
return 0;
}
And below is the output of this code compilation using option "/Qvec-report:2" enabled.
2> --- Analyzing function: main
2> d:\test\ssetestonvectors\main.cpp(12) : info C5002: loop not vectorized due to reason '1200'
2> d:\test\ssetestonvectors\main.cpp(18) : info C5002: loop not vectorized due to reason '1200'
When i read about the error code 1200 on msdn page:
https://msdn.microsoft.com/en-us/library/jj658585.aspx
It specifies that error code 1200 is due to "Loop contains loop carried data dependence"
I am unable to understand how this loop is containing that. I am having some sort of code that i need to optimize so that it can use Auto-Vectorization feature of Visual studio so that it can be optimized for SSE2. This code contains vector operations. So i am unable to do that because each time visual studio is showing some error code like this.
I think your problem is that:
for(int i=0,imax=tempSize;i<imax;i++)
tempVec[i] = tempVec1[i] + tempVec2[i];
Is actually
for(int i=0,imax=tempSize;i<imax;i++)
tempVec.operator[](i) = tempVec1.operator[](i) + tempVec2.operator[](i);
... and the vectorizer is failing to look insider the function calls. The first fix for that is:
const double* t1 = &tempVec1.front();
const double* t2 = &tempVec2.front();
double *t = &tempVec.front();
for(int i=0,imax=tempSize;i<imax;i++)
t[i] = t1[i] + t2[i];
The problem with that, is that the vectoriser can't see that t, t1, and t2 don't overlap. You have to promise the compiler they don't:
const double* __restrict t1 = &tempVec1.front();
const double* __restrict t2 = &tempVec2.front();
double * __restrict t = &tempVec.front();
for(int i=0,imax=tempSize;i<imax;i++)
t[i] = t1[i] + t2[i];
Obviously (I hope) use of the __restrict keyword (which is not part of standard C++) means this code will not be portable to other C++ compilers.
Edit: The OP has clarified that replacing calls to operator[] with call to at produces a different failure message (although that might be because at is more complex).
If the problem is not the function calls, my next hypothesis is that operator [] boils down to something like return this.__begin[i]; and the vectorizer doesn't know that different std::vectors have non-overlapping memory. If so, the final code block is still the solution.
Auto-vectorization is a rather new feature of MSVC, and you're using an older version of MSVC. So it's far from perfect. Microsoft knows that, so they've decided to only vectorize code when it is absolutely safe.
The particular error message is a bit terse. In reality, it should say "Loop might contain loop-carried data dependence". Since MSVC can't prove their absence, it doesn't vectorize.
I am learning how to use OpenMP with C++ using GNU C compiler 6.2.1 and I tested the following code:
#include <stdio.h>
#include <omp.h>
#include <iostream>
int b=10;
int main()
{
int array[8];
std::cout << "Test with #pragma omp simd linear:\n";
#pragma omp simd linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
std::cout << "Test with #pragma omp parallel for linear:\n";
#pragma omp parallel for linear(b)
for (int n=0;n<8;++n) array[n]=b;
for (int n=0;n<8;++n) printf("Iteration %d: %d\n", n, array[n]);
}
In both cases I expected a list of numbers going from 10 to 17, however, this was not the case. The #pragma omp simd linear(b) is outright ignored, printing only 10 for each value in array. For #pragma omp parallel for linear(b) the program outputs 10,10,12,12,14,14,16,16.
I compile the file using g++ -fopenmp -Wall main.cpp -o main.o. How can I fix this?
EDIT: Reading the specification more carefully I found that the linear clausule overwrites the initial value with the last value obtained (i.e. if we start with b=10 after the first cycle we have b=17).
However, the program runs correctly if I add schedule(dynamic) to the parallel for cycles. Why would I have to specify that parameter in order to have a correct execution?
The OpenMP specification says:
The linear clause declares one or more list items to be private and to
have a linear relationship with respect to the iteration space of a
loop associated with the construct on which the clause appears.
This is an information only to the compiler to indicate the linear behavior of a variable in a loop, but in your code b is not increased at all. That is the reason you always get 10 in the first loop. So, the strange results obtained is not the compiler's fault. To correct it you have to use
array[n]=b++;
On the other hand, for #pragma omp parallel for linear(b) loop, OpenMP calculates the starting b value for each thread (based on the linear relationship), but this value is still not increased in a given thread. So, depending on the number of threads used you will see different number of "steps".
In the case of schedule(dynamic) clause, the chunk_size is 1, so each loop cycle runs in a different thread. In this case the initial b value is always calculated by OpenMP, so you get correct values only.
I am doing some image processing, for which I benefit from vectorization.
I have a function that vectorizes ok, but for which I am not able to convince the compiler that the input and output buffer have no overlap, and so no alias checking is necessary.
I should be able to do so using __restrict__, but if the buffers are not defined as __restrict__ when arriving as function argument, there is no way to convince the compiler that I am absolutely sure that 2 buffers will never overlap.
This is the function:
__attribute__((optimize("tree-vectorize","tree-vectorizer-verbose=6")))
void threshold(const cv::Mat& inputRoi, cv::Mat& outputRoi, const unsigned char th) {
const int height = inputRoi.rows;
const int width = inputRoi.cols;
for (int j = 0; j < height; j++) {
const uint8_t* __restrict in = (const uint8_t* __restrict) inputRoi.ptr(j);
uint8_t* __restrict out = (uint8_t* __restrict) outputRoi.ptr(j);
for (int i = 0; i < width; i++) {
out[i] = (in[i] < valueTh) ? 255 : 0;
}
}
}
The only way I can convince the compiler to not perform the alias checking is if I put the inner loop in a separate function, in which the pointers are defined as __restrict__ arguments. If I declare this inner function as inlined, again the alias checking is activated.
You can see the effect also with this example, which I think is consistent: http://goo.gl/7HK5p7
(Note: I know there might be better ways of writing the same function, but in this case I am just trying to understand how to avoid alias check)
Edit:
Problem is solved!! (See answer below)
Using gcc 4.9.2, here is the complete example. Note the use of the compiler flag -fopt-info-vec-optimized in place of the superseded -ftree-vectorizer-verbose=N.
So, for gcc, use #pragma GCC ivdep and enjoy! :)
if you are using Intel compiler, you can try to include the line:
#pragma ivdep
The following paragraph is quoted from Intel compiler user manual:
The ivdep pragma instructs the compiler to ignore assumed vector
dependencies. To ensure correct code, the compiler treats an assumed
dependence as a proven dependence, which prevents vectorization. This
pragma overrides that decision. Use this pragma only when you know
that the assumed loop dependencies are safe to ignore.
In gcc, one should add the line:
#pragma GCC ivdep
inside the function and right before the loop you want to vectorize (see documentation). This is only supported starting from gcc 4.9 and, by the way, makes the use of __restrict__ redundant.
Another approach for this specific issue that is standardised and fully portable across (reasonably modern) compiler is to use the OpenMP simd directive, which is part of the standard since version 4.0. The code then becomes:
void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
unsigned char* outputRoi, const int width,
const int stride, const int height) {
#pragma omp simd
for (int i = 0; i < width; i++) {
outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
}
}
And when compiled with OpenMP support enabled (with either full support or only partial one for simd only, like with -qopenmp-simd for the Intel compiler), then the code is fully vectorised.
In addition, this gives you the opportunity to indicate possible alignment of vectors, which can come handy in some circumstances. For example, had your input and output arrays been allocated with an alignment-aware memory allocator, such a posix_memalign() with an alignment requirement of 256b, then the code could become:
void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
unsigned char* outputRoi, const int width,
const int stride, const int height) {
#pragma omp simd aligned(inputRoi, outputRoi : 32)
for (int i = 0; i < width; i++) {
outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
}
}
This should then permit to generate an even faster binary. And this feature isn't readily available using the ivdep directives. All the more reasons to use the OpenMP simd directive.
The Intel compiler at least as of version 14 does not generate aliasing checks for threshold2 in the code you linked indicating that your approach should work. However, the gcc auto-vectorizer misses this opportunity for optimization but does generate vectorized code, tests for proper alignment, tests for aliasing and non-vectorized fall-back/clean-up code.