Results of OpenMP target directives on PGI - c++

I'm using PGI to compile the following program which uses OpenMP's target directives to offload work to a GPU:
#include <iostream>
#include <cmath>
int main(){
const int SIZE = 400000;
double *m;
m = new double[SIZE];
#pragma omp target teams distribute parallel for
for(int i=0;i<SIZE;i++)
m[i] = std::sin((double)i);
for(int i=0;i<SIZE;i++)
std::cout<<m[i]<<"\n";
}
My compilation string is as follows:
pgc++ -omp -ta=tesla,pinned,cc60 -Minfo=accel -fast test2.cpp
Compilation succeeds, but it lacks the series of outputs that I get with OpenACC that tell me what the compiler actually did with the directive, like so:
main:
8, Accelerator kernel generated
Generating Tesla code
11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
8, Generating implicit copyout(m[:400000])
How can I get similar information for OpenMP? -Minfo by itself didn't seem to yield anything useful.

"-Minfo" (which is the same as "-Minfo=all"), or "-Minfo=mp" will give you compiler feedback messages for OpenMP compilation.
Though, PGI only supports OpenMP 4.5 directives with our LLVM back-end compilers. These are available by default on IBM Power based systems or as a part of our LLVM beta compilers on x86. The x86 beta compilers can be found at http://www.pgroup.com/support/download_llvm.php but do require a Professional Edition license.
Also, our current OpenMP 4.5 only targets multicore CPU. We're working on GPU target offload as well but this support wont be available for awhile.

Related

clang 10 & OpenMP on range-based for error (docs say should be ok)

I am testing clang 10.0 on an c++17 & OpenMP project and get errors when #pragma omp parallel for is used on range-based for.
Release notes for clang 10, in the OpenMP Support in Clang section, say quite clearly:
Added support for range-based loops.
When I compile a MWE with clang++-10 -fopenmp -std=c++17 (see https://godbolt.org/z/fdTeMo for online compiler):
#include<vector>
#include<iostream>
int main(int argc, char** argv){
std::vector<int> ii{0,11,22,33,44,55,66};
#pragma omp parallel for
for(int& i: ii){
std::cerr<<i<<std::endl;
}
}
I get:
<source>:6:5: error: statement after '#pragma omp parallel for' must be a for loop
for(int& i: ii){
^
1 error generated.
Compiler returned: 1
What's up?
Support for range-based for loops was added to OpenMP 5.0, and, as is also described in the Clang 10 Release Notes that you link to, you need to explicitly use the -fopenmp-version=50 option to activate support for it:
OpenMP Support in Clang
Use -fopenmp-version=50 option to activate support for OpenMP 5.0.
Thus, if we expand you compilation command to clang++-10 -fopenmp -fopenmp-version=50 -std=c++17, the OMP pragma accepts the range based for loop that follows it.
DEMO.

OpenACC compatible with GNU Scientific Library (GSL)?

I am testing to see if I can even use GSL functions within OpenACC compute regions. In Main.c I try the following (silly) for loop which uses GSL functions,
#pragma acc kernels
for(int i=0; i<100; i++){
gsl_matrix *C = gsl_matrix_calloc(10, 10);
gsl_matrix_free(C);
}
which allocates memory for a 10x10 matrix of zeroes, and then frees the memory, 100 times. However when I compile,
pgcc -pg -fast -acc -Minfo=all,intensity -lgsl -lgslcblas -lm -o Main Main.c
I get the following messages,
PGC-S-0155-Procedures called in a compute region must have acc routine information: gsl_matrix_calloc (Main.c: 60)
PGC-S-0155-Accelerator region ignored; see -Minfo messages (Main.c: 57)
main:
57, Accelerator region ignored
58, Intensity = 1.00
Loop not vectorized/parallelized: contains call
60, Accelerator restriction: call to 'gsl_matrix_calloc' with no acc routine information
In particular, do the first and last messages regarding "acc routine information", mean it is not possible to use GSL functions within acc compute regions?
I haven't seen direct support for the GSL libraries.
You will need to obtain the source code for the GSL routines that you are using and insert "!$acc routine" pragmas where the subroutines or functions are defined.
This will instruct the compiler to generate kernels for the GPU. Following those pragma insertions, you should compile the GSL libraries using the
-acc flag during compilation.

c++ openmp and threadprivate

I'm in a situation where on one computer (cluster with high perf nodes) a code
compiles but on my personal computer it doesn't.
The error is
'var' declared 'threadprivate' after first use.
#pragma omp threadprivate(var)
The related line in the code is in a header file and looks like this
extern const int var;
#pragma omp threadprivate(var);
I haven't written the code so it is difficult to give a minimal example of the
problem.
Here are some specification of the computer I use :
cluster (compiles)
red hat 7.5
gcc 4.8.3 EDIT intel 15.0.0
openmp version date : 2011.07 : don't have permission to access yum/apt/...
personal computer (doesn't compile)
debian 8.0
gcc 4.9.2
openmp version date : 2013.07 : libgomp1 v 4.9.2-10
I know there is not enough information, but does anyone have an idea ?

__m256 unknown type (clang 5.1/i5 CPU)?

I just started to experiment with intrinsics. I managed to successfully compile a program using __m128 on a Mac using Clang 5.1. The CPU on this Mac is an Intel core i5 M540.
When I tried to compile the same code with __m256, I get the following message:
simple.cpp:4:2: error: unknown type name '__m256'
__m256 A;
The code looks like this:
#include <immintrin.h>
int main()
{
__m256 A;
return 0;
}
And here is the command used to compile it:
c++ -o simple simple.cpp -march=native -O3
Is it just that my CPU is too old to support AVX instruction set? Are all the options I use (on the command line) correct? I checked in the immintrin.h include file, and it does call another including file which seems to be defining AVX intrinsics. Apologies if the question is naive or if the terminology is misused, as I said, I am new to this topic.
The Intel 540M CPU is in the Westmere microarchitecture (sorry for the mistake in the comment) which appears before Sandy Bridge when AVX was introduced so it doesn't support AVX. The term "core i5" covers a wide range of architectures from Nehalem to Haswell (current) so using a core i5 CPU doesn't mean that you'll have support for all instruction sets like the lates ones.

Ignore OpenMP on machine that does not have it

I have a C++ program using OpenMP, which will run on several machines that may have or not have OpenMP installed.
How could I make my program know if a machine has no OpenMP and ignore those #include <omp.h>, OpenMP directives (like #pragma omp parallel ...) and/or library functions (like tid = omp_get_thread_num();) ?
OpenMP compilation adds the preprocessor definition "_OPENMP", so you can do:
#if defined(_OPENMP)
#pragma omp ...
#endif
For some examples, see http://bisqwit.iki.fi/story/howto/openmp/#Discussion and the code which follows.
Compilers are supposed to ignore #pragma directives they don't understand; that's the whole point of the syntax. And the functions defined in openmp.h have simple well-defined meanings on a non-parallel system -- in particular, the header file will check for whether the compiler defines ENABLE_OPENMP and, if it's not enabled, provide the right fallbacks.
So, all you need is a copy of openmp.h to link to. Here's one: http://cms.mcc.uiuc.edu/qmcdev/docs/html/OpenMP_8h-source.html .
The relevant portion of the code, though, is just this:
#if defined(ENABLE_OPENMP)
#include <omp.h>
#else
typedef int omp_int_t;
inline omp_int_t omp_get_thread_num() { return 0;}
inline omp_int_t omp_get_max_threads() { return 1;}
#endif
At worst, you can just take those three lines and put them in a dummy openmp.h file, and use that. The rest will just work.
OpenMP is a compiler runtime thing and not a platform thing.
ie. If you compile your app using Visual Studio 2005 or higher, then you always have OpenMP available as the runtime supports it. (and if the end-user doesn't have the Visual Studio C runtime installed, then your app won't work at all).
So, you don't need to worry, if you can use it, it will always be there just like functions such as strcmp. To make sure they have the CRT, then you can install the visual studio redistributable.
edit:
ok, but GCC 4.1 will not be able to compile your openMP app, so the issue is not the target machine, but the target compiler. As all compilers have pre-defined macros giving their version, wrap your OpenMP calls with #ifdef blocks. for example, GCC uses 3 macros to identify the compiler version, __GNUC__, __GNUC_MINOR__ and __GNUC_PATCHLEVEL__
How could I make my program know if a machine has no OpenMP and ignore those #include <omp.h>, OpenMP directives (like #pragma omp parallel ...) and/or library functions (like tid = omp_get_thread_num();) ?
Here's a late answer, but we just got a bug report due to use of #pragma omp simd on Microsoft compilers.
According to OpenMP Specification, section 2.2:
Conditional Compilation
In implementations that support a preprocessor, the _OPENMP macro name
is defined to have the decimal value yyyymm where yyyy and mm are the
year and onth designations of the version of the OpenMP API that the
implementation supports.
It appears modern Microsoft compilers only support OpenMP from sometime between 2000 and 2005. I can only say "sometime between" because OpenMP 2.0 was released in 2000, and OpenMP 2.5 was released in 2005. But Microsoft advertises a version from 2002.
Here are some _OPENMP numbers...
Visual Studio 2012 - OpenMP 200203
Visual Studio 2017 - OpenMP 200203
IBM XLC 13.01 - OpenMP 201107
Clang 7.0 - OpenMP 201107
GCC 4.8 - OpenMP 201107
GCC 8.2 - OpenMP 201511
So if you want to use, say #pragma omp simd to guard a loop, and #pragma omp simd is available in OpenMP 4.0, then:
#if _OPENMP >= 201307
#pragma omp simd
for (size_t i = 0; i < 16; ++i)
data[i] += x[i];
#else
for (size_t i = 0; i < 16; ++i)
data[i] += x[i];
#endif
which will run on several machines that may have or not have OpenMP installed.
And to be clear, you probably need to build your program on each of those machines. The x86_64 ABI does not guarantee OpenMP is available on x86, x32 or x86_64 machines. And I have not read you can build on one machine, and then run on another machine.
There is another approach that I like, borrowed from Bisqwit:
#if defined(_OPENMP)
#include <omp.h>
extern const bool parallelism_enabled = true;
#else
extern const bool parallelism_enabled = false;
#endif
Then, start your OpenMP parallel for loops like this:
#pragma omp parallel for if(parallelism_enabled)
Note: there are valid reasons for not using pragma, which is non-standard, hence why Google and others do not support it.