How to run code on a GPU? - c++

LLVM has a back end for both AMD and NVIDIA GPUS. Is it currently possible to compile c++ (or a subset) to GPU code with clang and run it? Obviously things like the standard library would be unavailable, as well as operator new and delete. I'm not looking for OpenCL or CUDA, I'm thinking of a fully ahead-of-time compiled program, even a trivial one.

No, you need some language like OpenCL or CUDA, because a GPGPU is not an ordinary computer and has a different programming model (grossly speaking, SIMD like). GPGPU compute kernels have specific constraints.
You might want to consider using OpenACC pragmas in your C++ code (and use a recent GCC compiler).

Related

Is compiling user made c++ code at runtime as an extension a good idea? [duplicate]

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

Is it possible to write OpenCL kernels in C++ rather than C?

I understand there's an openCL C++ API, but I'm having trouble compiling my kernels... do the kernels have to be written in C? And then it's just the host code that's allowed to be written in C++? Or is there some way to write the kernels in C++ that I'm not finding? Specifically, I'm trying to compile my kernels using pyopencl, and it seems to be failing because it's compiling them as C code.
OpenCL C is a subset of C99.
There is also OpenCL C++ (OpenCL 2.1 and OpenCL 2.2 specs) which is a subset of C++14 but it's not implemented by any vendor yet (OpenCL 2.1 partially implemented by Intel but not C++ kernels).
Host code can be written in C,C++,python, etc.
In short you can read about OpenCL on wikipedia. There is a description about each OpenCL version. In pyopencl you can use OpenCL1.2 (as far as I'm aware there isn't support for OpenCL2.0 yet).
More details about OpenCL on Khronos website.
I would add SYCL on ComputeCpp from Codeplay. They have been very active at IWOCL.org promoting the use of single source C++ host and kernel code. SYCL has OpenCL execution model "under the hood". https://en.wikipedia.org/wiki/SYCL. Though Wikipedia has this statement about SYCL: "The open standards SYCL and OpenCL are similar to vendor-specific CUDA from Nvidia." Which cannot be any further from the intent of portable code (not performance portable) of SYCL and OpenCL.
You can find information, news, blogs, videos and resourcs on SYCL on the sycl.tech website.
For reference, there's also Boost.Compute. It doesn't help you with pyopencl, but it addresses many of the issues that pyopencl does, and has some metaprogramming magic that facilitates writing OpenCL kernels in C++.
This SO question (referenced in the Boost.Compute FAQ) also contains a nice discussion of some of the relevant design constraints that OpenCL poses to devs.
This is an old question, and the work to "solve" it has been ongoing for some time...
There is a community-driven C++ for OpenCL kernel language that is implemented by clang Clang C++ for OpenCL and there is a Khronos extension cl_ext_cxx_for_opencl that adds an online compilation of this language to OpenCL drivers too. Arm has just announced the support for this extension. Although it is also possible to compile kernels in this language offline using upstream tools into machine binary, SPIR-V, or any other IR and then load the precompiled code in OpenCL drivers without any extension.

Using libraries like boost in cuda device code

I am learning cuda at the moment and I am wondering if it is possible to use functions from different libraries and api's like boost in cuda device code.
Note: I tried using std::cout and that did not work I got printf working after changing the code generation to compute_20,sm_20. I'm using Visual Studio 2010. Cuda 5.0. GPU Nvidia GTX 570. NSIght is installed.
Here's an answer. And here's the CUDA documentation about language support. Boost won't make it for sure.
Since the purpose of using CUDA is to speed up kernels in your code, you'll typically want to limit the language complexities used, because of adding overhead. This will mean that you'll typically stay very close to plain C, with just a few sprinkles of C++ if that's really handy.
Constructs in for example Boost can result in large amounts of assembly code (and C++ in general has been criticised for this and is a reason to not use certain constructs in real-time software). This is all fine for most applications, but not for kernels you'll want to run on a GPU, where every instruction counts.
For CUDA (or OpenCL), people typically write intense algorithms that work on data in arrays. For example special image processing. You only use these techniques to do the calculation intensive tasks of you application. Then you have a 'regular' program that interacts with the user/network/database which creates these CUDA tasks (i.e. selects the data and parameters) and gives starts them. Here are CUDA samples.
Boost uses the expression templates technique not to loose performance while enabling a simpler syntax.
BlueBird and Newton are libraries using metaprogramming, similarly to Boost, enabling CUDA computations.
ArrayFire is another library using Just in Time compilation and exploiting the CUDA language underneath.
Finally, Thrust, as suggested by Njuffa, is a template library enabling CUDA calculations (but not using metaprogramming, see Evaluating expressions consisting of elementwise matrix operations in Thrust).

Explanation of CUDA C and C++

Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?
CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)
CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).
Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.
Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.
What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.

User defined CUDA code in C++

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.