Is OpenMP vectorization guaranteed?

Is OpenMP vectorization guaranteed? - c++

Does the OpenMP standard guarantee #pragma omp simd to work, i.e. should the compilation fail if the compiler can't vectorize the code?
#include <cstdint>
void foo(uint32_t r[8], uint16_t* ptr)
{
const uint32_t C = 1000;
#pragma omp simd
for (int j = 0; j < 8; ++j)
if (r[j] < C)
r[j] = *(ptr++);
}
gcc and clang fail to vectorize this but do not complain at all (unless you use -fopt-info-vec-optimized-missed and the like).

No, it is not guaranteed. Relevant portions of the OpenMP 4.5 standard that I could find (emphasis mine):
(1.3) When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread.
(2.8.1) The simd construct can be applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions).
(Appendix C) The number of iterations that are executed concurrently at any given time is implementation defined.
(1.2.7) implementation defined: Behavior that must be documented by the implementation, and is allowed to vary among different compliant implementations. An implementation is allowed to define this behavior as unspecified.

Related

Is seq_cst needed for synchronization of OpenMP atomic updates?

I read here that sequential memory consistency (seq_cst) "might be needed" to make sure an atomic update is viewed by all threads consistently in an OpenMP parallel region.
Consider the following MWE, which is admittedly trivial and could be realized with a reduction rather than atomics, but which illustrates my question that arose in a more complex piece of code:
#include <iostream>
int main()
{
double a = 0;
#pragma omp parallel for
for (int i = 0; i < 10000000; ++i)
{
#pragma omp atomic
a += 5.5;
}
std::cout.precision(17);
std::cout << a << std::endl;
return 0;
}
I compiled this with g++ -fopenmp -O3 using GCC versions 6 to 12 on an Intel Core i9-9880H CPU, and then ran it using 4 or 8 threads, which always correctly prints:
55000000
When adding seq_cst to the atomic directive, the result is exactly the same. I would have expected the code without seq_cst to (occasionally) produce smaller results due to race conditions / outdated memory view. Is this hardware dependent? Is the code guaranteed to be free of race conditions even without seq_cst, and if so, why? Would the answer be different when using a compiler that was still based on OpenMP 3.1, as that apparently worked somewhat differently?

OMP data dependency array in a struct

I am new in parallel programming with OpenMP and I am just learning how task and data dependency work.
I develop a simply matrix multiplication (using blocks) program where I define a struct as follow:
struct matrix {
int ncols;
int nrows;
double* mat;
};
Now, for each matrix I do a malloc to obtain a linear vector and so a linearised matrix.
The parallelized code that I'm tring to write is this:
#pragma omp parallel
#pragma omp single
for(i=0; i<m1->nrows; i+=BS){
for(j=0; j<m2->ncols; j+=BS){
for(k=0; k<m3->ncols; k+=BS){
#pragma omp task depend(in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
for (ii = i; ii < i+BS; ii++) {
for (jj = j; jj < j+BS; jj++) {
for (kk = k; kk < k+BS; kk++) {
m3->mat[ii * m3->ncols + jj] += m1->mat[ii*m1->ncols+kk] * m2->mat[kk*m2->ncols+jj];
}
}
}
}
}
}
The problem is that the compiler reports some errors but I am sure that it is possible to set dependencies with arrays...
mat_mul_blocks.c:67:42: error: expected ‘]’ before ‘:’ token
67 | #pragma omp task depend(in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
| ^
| ]
mat_mul_blocks.c:67:60: error: expected ‘]’ before ‘:’ token
67 | #pragma omp task depend(in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])
| ^
| ]
mat_mul_blocks.c:67:92: error: expected ‘]’ before ‘:’ token
67 | in: m1->mat[i:BS*BS], m2->mat[k:BS*BS]) depend(inout: m3->mat[i:BS*BS])

Based on Section 2.19.11 and Section 2.1 of the OpenMP 5.1 specification:
The syntax of the depend clause is as follows:
depend([depend-modifier,] dependence-type: locator-list)
[...]
A locator-list consists of a comma-separated collection of one or more locator list items
[...]
The list items that appear in the depend clause may include array sections or the omp_all_memory reserved locator.
Thus, put it shortly: this is totally conforming for compiler not to implement array section parsing/support. This is actually the case of GCC, while Clang parses them correctly.
Several compilers and runtimes do not care-about/support array sections in depend clause. AFAIK, all mainstream OpenMP implementations (including GOMP of GCC and IOMP of Clang/ICC) just ignore them at runtime so far... The rational is that the dependency analysis would be clearly too much expensive to perform at runtime (some research project tried to implement this, but the performance results were not great). Because the OpenMP runtime used is tightly bound to compilers and because of the previous point, some compilers may not support array sections at all in depend clause (which means it will results in parsing errors in your case).
That being said, based on Section 2.1.5, the array-section syntax you use looks conforming to the OpenMP standard but be aware that locators/array-sections must not overlap. In your case, they seems to overlapp breaking the OpenMP standard and resulting in an undefined behaviour for OpenMP runtimes supporting array sections.
I advise you not to use array sections in depend clause. Instead, you can use pointers with dependency locators predefined outside the directive to avoid compiler parsing issues:
const double* dep1 = &m3->mat[i * m3->ncols + j];
const double* dep2 = &m1->mat[i * m1->ncols + k];
const double* dep3 = &m2->mat[k * m1->ncols + j];
#pragma omp task depend(in: *dep1, *dep2) depend(inout: *dep3)
This code should work on most compilers including GCC, Clang and ICC (MSVC only support OpenMP 2.0 so far). Note that since C++17, you can use the attribute [[maybe_unused]] to avoid compilers generating useless warnings for the unused variables when OpenMP is not enabled/supported by the target compiler (or wrongly detected as unused).

This code prints the value of x around 5000 but not 10000, why is that?

This code that I have written creates 2 threads and a for loop that iterates 10000 times but the value of x at the end comes out near 5000 instead of 10000, why is that happening?
#include<unistd.h>
#include<stdio.h>
#include<sys/time.h>
#include "omp.h"
using namespace std;
int x=0;
int main(){
omp_set_num_threads(2);
#pragma omp parallel for
for(int i= 0;i<10000;i++){
x+=1;
}
printf("x is: %d\n",x);
}

x is not an atomic type and is read and written in different threads. (Thinking that int is an atomic type is a common misconception.)
The behaviour of your program is therefore undefined.
Using std::atomic<int> x; is the fix.

The reason is, that when multiple threads access the same variable, race conditions can occur.
The operation x+=1 can be understand as: x = x + 1. So you first read the value of x and then write x + 1 to x. When you have two threads running and operating on the same value of x, following happens: Thread A reads the value of x which is 0. Thread B reads the value of x which is still 0. Then thread A writes 0+1 to x. And then Thread B writes 0+1 to x. And now you have missed one increment and x is just 1 instead of 2. A fix for this problem might be to use an atomic_int.

Modifying one (shared) value by multiple threads is a race condition and leads to wrong results. If multiple threads work with one value, all of them must only read the value.
The idiomatic solution is to use a OpenMP reduction as follows
#pragma omp parallel for reduction(+:x)
for(int i= 0;i<10000;i++){
x+=1;
}
Internally, each thread has it's own x and they are added together after the loop.
Using atomics is an alternative, but will perform significantly worse. Atomic operations are more costly in itself and also very bad for caches.
If you use atomics, you should use OpenMP atomics which are applied to the operation, not the variable. I.e.
#pragma omp parallel for
for (int i= 0;i<10000;i++){
#pragma omp atomic
x+=1;
}
You should not, as other answers suggest, use C++11 atomics. Using them is explicitly unspecified behavior in OpenMP. See this question for details.

Auto-vectorizing: Convincing the compiler that alias check is not necessary

I am doing some image processing, for which I benefit from vectorization.
I have a function that vectorizes ok, but for which I am not able to convince the compiler that the input and output buffer have no overlap, and so no alias checking is necessary.
I should be able to do so using __restrict__, but if the buffers are not defined as __restrict__ when arriving as function argument, there is no way to convince the compiler that I am absolutely sure that 2 buffers will never overlap.
This is the function:
__attribute__((optimize("tree-vectorize","tree-vectorizer-verbose=6")))
void threshold(const cv::Mat& inputRoi, cv::Mat& outputRoi, const unsigned char th) {
const int height = inputRoi.rows;
const int width = inputRoi.cols;
for (int j = 0; j < height; j++) {
const uint8_t* __restrict in = (const uint8_t* __restrict) inputRoi.ptr(j);
uint8_t* __restrict out = (uint8_t* __restrict) outputRoi.ptr(j);
for (int i = 0; i < width; i++) {
out[i] = (in[i] < valueTh) ? 255 : 0;
}
}
}
The only way I can convince the compiler to not perform the alias checking is if I put the inner loop in a separate function, in which the pointers are defined as __restrict__ arguments. If I declare this inner function as inlined, again the alias checking is activated.
You can see the effect also with this example, which I think is consistent: http://goo.gl/7HK5p7
(Note: I know there might be better ways of writing the same function, but in this case I am just trying to understand how to avoid alias check)
Edit:
Problem is solved!! (See answer below)
Using gcc 4.9.2, here is the complete example. Note the use of the compiler flag -fopt-info-vec-optimized in place of the superseded -ftree-vectorizer-verbose=N.
So, for gcc, use #pragma GCC ivdep and enjoy! :)

if you are using Intel compiler, you can try to include the line:
#pragma ivdep
The following paragraph is quoted from Intel compiler user manual:
The ivdep pragma instructs the compiler to ignore assumed vector
dependencies. To ensure correct code, the compiler treats an assumed
dependence as a proven dependence, which prevents vectorization. This
pragma overrides that decision. Use this pragma only when you know
that the assumed loop dependencies are safe to ignore.
In gcc, one should add the line:
#pragma GCC ivdep
inside the function and right before the loop you want to vectorize (see documentation). This is only supported starting from gcc 4.9 and, by the way, makes the use of __restrict__ redundant.

Another approach for this specific issue that is standardised and fully portable across (reasonably modern) compiler is to use the OpenMP simd directive, which is part of the standard since version 4.0. The code then becomes:
void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
unsigned char* outputRoi, const int width,
const int stride, const int height) {
#pragma omp simd
for (int i = 0; i < width; i++) {
outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
}
}
And when compiled with OpenMP support enabled (with either full support or only partial one for simd only, like with -qopenmp-simd for the Intel compiler), then the code is fully vectorised.
In addition, this gives you the opportunity to indicate possible alignment of vectors, which can come handy in some circumstances. For example, had your input and output arrays been allocated with an alignment-aware memory allocator, such a posix_memalign() with an alignment requirement of 256b, then the code could become:
void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
unsigned char* outputRoi, const int width,
const int stride, const int height) {
#pragma omp simd aligned(inputRoi, outputRoi : 32)
for (int i = 0; i < width; i++) {
outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
}
}
This should then permit to generate an even faster binary. And this feature isn't readily available using the ivdep directives. All the more reasons to use the OpenMP simd directive.

The Intel compiler at least as of version 14 does not generate aliasing checks for threshold2 in the code you linked indicating that your approach should work. However, the gcc auto-vectorizer misses this opportunity for optimization but does generate vectorized code, tests for proper alignment, tests for aliasing and non-vectorized fall-back/clean-up code.

Fetch-and-add using OpenMP atomic operations

I’m using OpenMP and need to use the fetch-and-add operation. However, OpenMP doesn’t provide an appropriate directive/call. I’d like to preserve maximum portability, hence I don’t want to rely on compiler intrinsics.
Rather, I’m searching for a way to harness OpenMP’s atomic operations to implement this but I’ve hit a dead end. Can this even be done? N.B., the following code almost does what I want:
#pragma omp atomic
x += a
Almost – but not quite, since I really need the old value of x. fetch_and_add should be defined to produce the same result as the following (only non-locking):
template <typename T>
T fetch_and_add(volatile T& value, T increment) {
T old;
#pragma omp critical
{
old = value;
value += increment;
}
return old;
}
(An equivalent question could be asked for compare-and-swap but one can be implemented in terms of the other, if I’m not mistaken.)

As of openmp 3.1 there is support for capturing atomic updates, you can capture either the old value or the new value. Since we have to bring the value in from memory to increment it anyways, it only makes sense that we should be able to access it from say, a CPU register and put it into a thread-private variable.
There's a nice work-around if you're using gcc (or g++), look up atomic builtins:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
It think Intel's C/C++ compiler also has support for this but I haven't tried it.
For now (until openmp 3.1 is implemented), I've used inline wrapper functions in C++ where you can choose which version to use at compile time:
template <class T>
inline T my_fetch_add(T *ptr, T val) {
#ifdef GCC_EXTENSION
return __sync_fetch_and_add(ptr, val);
#endif
#ifdef OPENMP_3_1
T t;
#pragma omp atomic capture
{ t = *ptr; *ptr += val; }
return t;
#endif
}
Update: I just tried Intel's C++ compiler, it currently has support for openmp 3.1 (atomic capture is implemented). Intel offers free use of its compilers in linux for non-commercial purposes:
http://software.intel.com/en-us/articles/non-commercial-software-download/
GCC 4.7 will support openmp 3.1, when it eventually is released... hopefully soon :)

If you want to get old value of x and a is not changed, use (x-a) as old value:
fetch_and_add(int *x, int a) {
#pragma omp atomic
*x += a;
return (*x-a);
}
UPDATE: it was not really an answer, because x can be modified after atomic by another thread.
So it's seems to be impossible to make universal "Fetch-and-add" using OMP Pragmas. As universal I mean operation, which can be easily used from any place of OMP code.
You can use omp_*_lock functions to simulate an atomics:
typedef struct { omp_lock_t lock; int value;} atomic_simulated_t;
fetch_and_add(atomic_simulated_t *x, int a)
{
int ret;
omp_set_lock(x->lock);
x->value +=a;
ret = x->value;
omp_unset_lock(x->lock);
}
This is ugly and slow (doing a 2 atomic ops instead of 1). But If you want your code to be very portable, it will be not the fastest in all cases.
You say "as the following (only non-locking)". But what is the difference between "non-locking" operations (using CPU's "LOCK" prefix, or LL/SC or etc) and locking operations (which are implemented itself with several atomic instructions, busy loop for short wait of unlock and OS sleeping for long waits)?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is OpenMP vectorization guaranteed? - c++

Related

Is seq_cst needed for synchronization of OpenMP atomic updates?

OMP data dependency array in a struct

This code prints the value of x around 5000 but not 10000, why is that?

Auto-vectorizing: Convincing the compiler that alias check is not necessary

Fetch-and-add using OpenMP atomic operations

Categories

Resources