OpenACC Library Interoperability: how to get device pointer? - fortran

We have a project that is written in Fortran.
Now I know this can be done using PGI compilers but I don't want to get stuck with licenses.
I am trying to see whether we could use OpenACC in our project. I got gcc5.2 installed using instructions here.
https://github.com/olcf/OLCFHack15
Now I want to do something similar to what is stated here.
https://gcc.gnu.org/onlinedocs/libgomp/OpenACC-Library-Interoperability.html
More specifically what is stated in section 8.3. I am trying to exactly reproduce it using gfortran. Unfortunately I don't see how I can do it in fortran. In the example,
d_X = acc_copyin(&h_X[0], N * sizeof (float));
This allows d_X to be directly used in
s = cublasSaxpy(h, N, &alpha, d_X, 1, d_Y, 1);
But in fortran, the acc_copyin does not return anything.
So how would I replicate the case in Fortran?

Are you looking to interface with cuBLAS or is this more general? cuBLAS does provide a F77 style interface (See: http://docs.nvidia.com/cuda/cublas/#appendix-b-cublas-fortran-bindings)
The OpenACC solution is to manage your data as you normally would using the "data" directive, but then call the CUDA C routine from within a "host_data" region. "host_data" specifies that the device pointer should be used with this region. Hence when passing "d_X" to cublasSaxpy, the device pointer will be passed in.
On caveat with cuBLAS is that the F77 interface mentioned above is expecting host arrays and will manage the data movement for you. Hence you'll need to write CUDA C wrapper functions to call the correct device routines. (CUDA Fortran does provide a cublas module for this but is PGI only)
Though, GNU 5.2 doesn't support "host_data" and I just looked on their status page (https://gcc.gnu.org/wiki/OpenACC) and it doesn't look like it will be supported in Fortran in 6.0 either. Which is unfortunate since "host_data" is the best solution for you.
Note that NVIDIA does give free PGI license to students and academics for teaching purposes as part of the OpenACC Toolkit (See: https://developer.nvidia.com/openacc).

Related

How to mix Coarray and MPI code in Fortran

I would like to combine Fortran coarray with MPI inside my code. I plan to use third party softwave(HYPRE), which used MPI, for linear system solvers. For the rest of my work, I want to use the Fortran coarray (OpenCoarrays). I've already searched for a solution on Internet. But there isn't any clues about how to make it work. I wonder is that possible to mix Fortran coarray and MPI. If yes, should I use the OpenCoarrays or MPI wrapper compilers
OpenCoarrays sits on top of MPI-3 RMA (at least by default; I don’t recall the latest status of the GASNet port) so this should work, even if neither standard guarantees this. You’ll be using process-parallel execution and they should interoperate fine.
Intel Fortran also uses MPI for coarrays. Cray Fortran coarrays use DMAPP, which is compatible with MPI. Thus, the interoperability you want should cover all the widely available implementations.
In all cases, there may be some implementation quirks, particularly with respect to initialization and termination. You may find that you can’t finalize MPI until all your coarrays are deallocated, for example.
I’m sure the developers of OpenCoarrays would appreciate big reports on this topic if you have problems.

Difference between C++ MEX and C MEX

I wrote a TCPIP-Socket-Connection with Server and Client in C++, which works quite nice in VisualStudio. Now I want to use the C++ - Client in MATLAB/Simulink through MEX-Files and later in a S-Function.
I found two descriptions about MEX-Files.
C++ MEX File Application Just for C++
C/C++ MEX Files C/C++
Now I am confused, which one would be the one to take. I wrote some easy programms with the second, but always got into problems with datatypes. I think, it is because the given examples and functions are only for C and not for C++.
I appreciate any help! Thank you very much!
The differences:
The C interface described in the second link is much, much older (I used this interface way back in 1998). You can create a MEX-file with this interface and have it run on a large set of different versions of MATLAB. You can use it from C as well as C++ code.
The C++-only interface described in the first link is new in MATLAB R2018a (the C++ classes to manipulate MATLAB arrays were introduced in R2017b, but the ability to write a MEX-file was new in R2018a). MEX-files you write with this interface will not run on prior versions of MATLAB.
Additionally, this interface (finally!) allows for creating shared-data copies, in-place operations, etc. (the stuff we have been asking for for many years, but they didn't want to put into the old C interface because they worried it would be too complex for the average MEX-file writer).
Another change to be aware of:
In R2018a, MATLAB also changed the way that complex arrays are stored in memory. In older versions of MATLAB, the real and imaginary components are stored in separate memory blocks. In R2018a and on, they are stored in the same memory block, in the same fashion as you would likely use in your own code.
This affects MEX-files! If you MEX-file uses complex data, it needs to read and write them in the way that MATLAB stores them. If you run a MEX-file compiled for an older version of MATLAB, or compile a MEX-file using the current default building options in R2018a, a complex array will be copied to the old storage model before being passed to the MEX-file. A new compile option to the mex command, -R2018a, creates MEX-files that pass the data in the new storage model unchanged. But those MEX-files will not be compatible with previous versions of MATLAB.
How to choose?
If you need your MEX-files to run on versions of MATLAB prior to the newest R2018a, use the old C interface, you don't have a choice.
If you want to program in C, use the old C interface.
If you need to use complex data, and don't want to incur the cost of the copy, you need to target R2018a and newer, and R2017b and older, separately. You need to write separate code for these two "platforms". The older versions can only be targeted with the C interface. For the newer versions you can use either interface.
If you appreciate the advantages of modern C++ and would like to take advantage of them, and are targeting only the latest and greatest MATLAB version, then use the new C++ interface. I haven't tried it out yet, but from the documentation it looks to be very well designed.

How is control flow extracted from a SYCL kernel?

Using SYCL to run code on any OpenCL device doesn't require a custom compiler, as everything is done in a library (full of template magic), and a standard GCC/Clang will do just fine. Is this correct? (Especially in the case of triSYCL, which I'm using...)
If so... I know that simple expression trees can be extracted by overloading a bunch of operators on custom "handle" or "wrapper" classes, but this is not the case with control flow. Am I wrong?
Section 3.1 of this paper discusses the pros and cons of a few different approaches to adding EDSLs to C++, but I'm more interested in the actual technical implementation of the method SYCL uses.
I tried to look at the source at some SYCL-related projects (Eigen, TensorFlow, triSYCL, ComputeCpp, etc.) but so far I could not find the answer in them.
So: How can a SYCL library(?) discover the full control flow graph of a kernel, given as an ordinary C++ lambda, without needing a custom/extended compiler?
I think you are right.
If you compile SYCL for CPU, since SYCL is a pure C++ executable DSEL, you can have an implementation that just uses a normal C++ compiler.
This is how triSYCL works for example. https://github.com/triSYCL/triSYCL
I do not know the detail about ComputeCpp. On https://github.com/triSYCL/triSYCL/blob/master/doc/about-sycl.rst there is a link about a very interesting but old presentation:
Implementing the OpenCL SYCL Shared Source C++ Programming Model using
Clang/LLVM, Gordon Brown. November 17, 2014, Workshop on the LLVM
Compiler Infrastructure in HPC, SuperComputing 2014
http://www.codeplay.com/public/uploaded/publications/SC2014_LLVM_HPC.pdf
In the case triSYCL is targeting a device, there is also a device compiler. I have to push a new version with a design document... In the meantime, you can look at https://github.com/triSYCL/triSYCL/tree/device https://github.com/triSYCL/llvm https://github.com/triSYCL/clang
sycl-gtx is using some SYCL syntax extensions based on macros to have a meta-representation of the control flow in the kernel, as shown for example on this example: https://github.com/ProGTX/sycl-gtx/blob/master/tests/regression/work_efficient_prefix_sum.cpp
And the answer is: This is not how it's done, and I still don't think it's possible.
Even my first assumption was wrong.
If all you have is an ordinary C++ compiler, then any SYCL kernel can only be executed "in software", by the host device (CPU) running the "controller" code.
To translate the kernels to OpenCL (or SPIR-V) for execution on any other device, either an "augmented" compiler is necessary; or two compilers, one for the host, and one for the compute device.
A good explanation can be found here: https://www.codeplay.com/portal/introduction-to-sycl
The most related section is "What Would A SYCL Work Flow Look Like?"

How to run code on a GPU?

LLVM has a back end for both AMD and NVIDIA GPUS. Is it currently possible to compile c++ (or a subset) to GPU code with clang and run it? Obviously things like the standard library would be unavailable, as well as operator new and delete. I'm not looking for OpenCL or CUDA, I'm thinking of a fully ahead-of-time compiled program, even a trivial one.
No, you need some language like OpenCL or CUDA, because a GPGPU is not an ordinary computer and has a different programming model (grossly speaking, SIMD like). GPGPU compute kernels have specific constraints.
You might want to consider using OpenACC pragmas in your C++ code (and use a recent GCC compiler).

Is it possible to write OpenCL kernels in C++ rather than C?

I understand there's an openCL C++ API, but I'm having trouble compiling my kernels... do the kernels have to be written in C? And then it's just the host code that's allowed to be written in C++? Or is there some way to write the kernels in C++ that I'm not finding? Specifically, I'm trying to compile my kernels using pyopencl, and it seems to be failing because it's compiling them as C code.
OpenCL C is a subset of C99.
There is also OpenCL C++ (OpenCL 2.1 and OpenCL 2.2 specs) which is a subset of C++14 but it's not implemented by any vendor yet (OpenCL 2.1 partially implemented by Intel but not C++ kernels).
Host code can be written in C,C++,python, etc.
In short you can read about OpenCL on wikipedia. There is a description about each OpenCL version. In pyopencl you can use OpenCL1.2 (as far as I'm aware there isn't support for OpenCL2.0 yet).
More details about OpenCL on Khronos website.
I would add SYCL on ComputeCpp from Codeplay. They have been very active at IWOCL.org promoting the use of single source C++ host and kernel code. SYCL has OpenCL execution model "under the hood". https://en.wikipedia.org/wiki/SYCL. Though Wikipedia has this statement about SYCL: "The open standards SYCL and OpenCL are similar to vendor-specific CUDA from Nvidia." Which cannot be any further from the intent of portable code (not performance portable) of SYCL and OpenCL.
You can find information, news, blogs, videos and resourcs on SYCL on the sycl.tech website.
For reference, there's also Boost.Compute. It doesn't help you with pyopencl, but it addresses many of the issues that pyopencl does, and has some metaprogramming magic that facilitates writing OpenCL kernels in C++.
This SO question (referenced in the Boost.Compute FAQ) also contains a nice discussion of some of the relevant design constraints that OpenCL poses to devs.
This is an old question, and the work to "solve" it has been ongoing for some time...
There is a community-driven C++ for OpenCL kernel language that is implemented by clang Clang C++ for OpenCL and there is a Khronos extension cl_ext_cxx_for_opencl that adds an online compilation of this language to OpenCL drivers too. Arm has just announced the support for this extension. Although it is also possible to compile kernels in this language offline using upstream tools into machine binary, SPIR-V, or any other IR and then load the precompiled code in OpenCL drivers without any extension.