Is it possible to write OpenCL kernels in C++ rather than C?

Is it possible to write OpenCL kernels in C++ rather than C? - c++

I understand there's an openCL C++ API, but I'm having trouble compiling my kernels... do the kernels have to be written in C? And then it's just the host code that's allowed to be written in C++? Or is there some way to write the kernels in C++ that I'm not finding? Specifically, I'm trying to compile my kernels using pyopencl, and it seems to be failing because it's compiling them as C code.

OpenCL C is a subset of C99.
There is also OpenCL C++ (OpenCL 2.1 and OpenCL 2.2 specs) which is a subset of C++14 but it's not implemented by any vendor yet (OpenCL 2.1 partially implemented by Intel but not C++ kernels).
Host code can be written in C,C++,python, etc.
In short you can read about OpenCL on wikipedia. There is a description about each OpenCL version. In pyopencl you can use OpenCL1.2 (as far as I'm aware there isn't support for OpenCL2.0 yet).
More details about OpenCL on Khronos website.

I would add SYCL on ComputeCpp from Codeplay. They have been very active at IWOCL.org promoting the use of single source C++ host and kernel code. SYCL has OpenCL execution model "under the hood". https://en.wikipedia.org/wiki/SYCL. Though Wikipedia has this statement about SYCL: "The open standards SYCL and OpenCL are similar to vendor-specific CUDA from Nvidia." Which cannot be any further from the intent of portable code (not performance portable) of SYCL and OpenCL.
You can find information, news, blogs, videos and resourcs on SYCL on the sycl.tech website.

For reference, there's also Boost.Compute. It doesn't help you with pyopencl, but it addresses many of the issues that pyopencl does, and has some metaprogramming magic that facilitates writing OpenCL kernels in C++.
This SO question (referenced in the Boost.Compute FAQ) also contains a nice discussion of some of the relevant design constraints that OpenCL poses to devs.

This is an old question, and the work to "solve" it has been ongoing for some time...
There is a community-driven C++ for OpenCL kernel language that is implemented by clang Clang C++ for OpenCL and there is a Khronos extension cl_ext_cxx_for_opencl that adds an online compilation of this language to OpenCL drivers too. Arm has just announced the support for this extension. Although it is also possible to compile kernels in this language offline using upstream tools into machine binary, SPIR-V, or any other IR and then load the precompiled code in OpenCL drivers without any extension.

Related

How is control flow extracted from a SYCL kernel?

Using SYCL to run code on any OpenCL device doesn't require a custom compiler, as everything is done in a library (full of template magic), and a standard GCC/Clang will do just fine. Is this correct? (Especially in the case of triSYCL, which I'm using...)
If so... I know that simple expression trees can be extracted by overloading a bunch of operators on custom "handle" or "wrapper" classes, but this is not the case with control flow. Am I wrong?
Section 3.1 of this paper discusses the pros and cons of a few different approaches to adding EDSLs to C++, but I'm more interested in the actual technical implementation of the method SYCL uses.
I tried to look at the source at some SYCL-related projects (Eigen, TensorFlow, triSYCL, ComputeCpp, etc.) but so far I could not find the answer in them.
So: How can a SYCL library(?) discover the full control flow graph of a kernel, given as an ordinary C++ lambda, without needing a custom/extended compiler?

I think you are right.
If you compile SYCL for CPU, since SYCL is a pure C++ executable DSEL, you can have an implementation that just uses a normal C++ compiler.
This is how triSYCL works for example. https://github.com/triSYCL/triSYCL
I do not know the detail about ComputeCpp. On https://github.com/triSYCL/triSYCL/blob/master/doc/about-sycl.rst there is a link about a very interesting but old presentation:
Implementing the OpenCL SYCL Shared Source C++ Programming Model using
Clang/LLVM, Gordon Brown. November 17, 2014, Workshop on the LLVM
Compiler Infrastructure in HPC, SuperComputing 2014
http://www.codeplay.com/public/uploaded/publications/SC2014_LLVM_HPC.pdf
In the case triSYCL is targeting a device, there is also a device compiler. I have to push a new version with a design document... In the meantime, you can look at https://github.com/triSYCL/triSYCL/tree/device https://github.com/triSYCL/llvm https://github.com/triSYCL/clang
sycl-gtx is using some SYCL syntax extensions based on macros to have a meta-representation of the control flow in the kernel, as shown for example on this example: https://github.com/ProGTX/sycl-gtx/blob/master/tests/regression/work_efficient_prefix_sum.cpp

And the answer is: This is not how it's done, and I still don't think it's possible.
Even my first assumption was wrong.
If all you have is an ordinary C++ compiler, then any SYCL kernel can only be executed "in software", by the host device (CPU) running the "controller" code.
To translate the kernels to OpenCL (or SPIR-V) for execution on any other device, either an "augmented" compiler is necessary; or two compilers, one for the host, and one for the compute device.
A good explanation can be found here: https://www.codeplay.com/portal/introduction-to-sycl
The most related section is "What Would A SYCL Work Flow Look Like?"

How to run code on a GPU?

LLVM has a back end for both AMD and NVIDIA GPUS. Is it currently possible to compile c++ (or a subset) to GPU code with clang and run it? Obviously things like the standard library would be unavailable, as well as operator new and delete. I'm not looking for OpenCL or CUDA, I'm thinking of a fully ahead-of-time compiled program, even a trivial one.

No, you need some language like OpenCL or CUDA, because a GPGPU is not an ordinary computer and has a different programming model (grossly speaking, SIMD like). GPGPU compute kernels have specific constraints.
You might want to consider using OpenACC pragmas in your C++ code (and use a recent GCC compiler).

C++ MPI standard 3

MPI standard 3 was released in 2011
with no C++ bindings !
my question is how to program distributed computing in C++ without MPI (note we need also OpenMP CUDA Openacc)
is there an alternative to MPI in C++ (not MPI 2.2, boost MPI)?
is MPI built on TCP/IP so i can build my own way using TCP/IP in C++ ?
is there open source binding to MPI 3 for C++ ?
or just you must stick to C GTK+ CUDA OpenMP OpenGL MPI 3
what if you want C++ QT CUDA OpenMP OpenGL + distributed computing API ?
Ubuntu and many Linux distros seeks to replace Xserver with Wayland and MIR both will write special API and layer to create context for OpenGL desktop to replace GLX also GTK+ will has MIR Wayland integeration so on Linux if something changed some people and groups try to fix it try to develop new solution
but MPI 3 C++ binding i don't find a solution to it

The official recommendation is to use the C bindings, for the reasons given in the comments. The only loss of functionality here pertains to exceptions and you won't miss it because no implementation was fault-tolerant in the MPI-2 era anyways.
Boost::MPI is nice but supports very few features (the most popular ones).
Rolling your own C++ wrappers is encouraged. Elemental (libelemental.org) has a nice set that do magic with type inference.
I have some personal interest in developing a new set of C++ bindings but haven't had time to make progress. There's a StackExchange Computational Science post with a detailed discussion to which you might contribute.

Confusion on CUDA/openCL and C++ AMP

I read that Microsoft is closely working with Nvidia to improve AMP performances.
But my question is: is AMP a CUDA-replace by Microsoft? Or does AMP use CUDA drivers when a NVIDIA CUDA video card is available? Is AMP an openCL substitute?
I'm still pretty confused..

C++ AMP is a library (and as part of it a key language extension was also introduced). Since C++ AMP is an open specification, it can be implemented on any other low level languages. Microsoft’s implementation builds on DirectCompute (and hence on HLSL), but that is completely hidden from you when you are using C++ AMP (which is why C++ AMP can be an open specification; it does not expose DirectX in the API surface). For more on C++ AMP, please follow the resources on the right of our blog (we’ll keep adding to that):
http://blogs.msdn.com/b/nativeconcurrency/
You made a statement about Microsoft working with NVIDIA to improve C++ AMP performance – that is not true. Microsoft has worked with NVIDA and AMD and other partners to create the C++ AMP open specification. Microsoft also work with hardware vendors to make sure that the hardware vendors have stable video card drivers, which are required for any GPU compute technology to work correctly.
You also expressed confusion and threw some terms out. OpenCL is an approach to GPU computing (by Khronos), as is DirectCompute (by Microsoft), as is CUDA (by NVIDIA). These are all separate technologies, each with its own path to the GPU (always via a driver of some sort), each with its own merits, strengths, and disadvantages. One does not replace the other, and one is not universally better than the other. You now also have C++ AMP in that mix, as one more choice, and the same statements apply to that. The choice is yours as to which you decide to use.

C++ AMP is a set of language extentions and APIs to support parallel programming technology including CUDA.
Since Microsoft also has a direct competitor to CUDA ( Direct Compute) and generally has preferred it's own proprietary graphics standards we will have to see what actually ever happens with it.
For Microsoft's view on it see these lectures

OpenCL or CUDA Which way to go?

I'm investigating ways of using GPU in order to process streaming data. I had two choices but couldn't decide which way to go?
My criterias are as follows:
Ease of use (good API)
Community and Documentation
Performance
Future
I'll code in C and C++ under linux.

OpenCL
interfaced from your production code
portable between different graphics hardware
limited operations but preprepared shortcuts
CUDA
separate language (CUDA C)
nVidia hardware only
almost full control over the code (coding in a C-like language)
lot of profiling and debugging tools
Bottom line -- OpenCL is portable, CUDA is nVidia only. However, being an independent language, CUDA is much more powerful and has a bunch of really good tools.
Ease of use -- OpenCL is easier to use out of the box, but once you setup the CUDA coding environment it's almost like coding in C.
Community and Documentation -- both have extensive documentation and examples, however I think CUDA has better.
Performance -- CUDA allows for greater control, hence can be better fine-tuned for higher performance.
Future -- hard to say really.

My personal experiences were:
API: OpenCL has slightly more complex api. However most time you will spent with writing kernel code, and here both are almost identical.
Community: CUDA has a much bigger community then OpenCL up til now, but this will probably about to even out.
Documentation: Both are very well documented.
Performance: We made the experience, that OpenCL drivers are not yet fully optimized.
Future: The future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware!
This assessment is from 2010, so probably out-dated.

OpenCL all the way unless you have a specific reason to use CUDA. OpenCL runs well on multicores like Intel i7 in addition to running on GPUs. By using OpenCL you can run it on a much wider range of hardware from Droid cell phones to the IBM Power7 compute nodes of the world's largest supercomputer, Blue Waters, which is supposed to come online next year.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js