How is control flow extracted from a SYCL kernel? - c++

Using SYCL to run code on any OpenCL device doesn't require a custom compiler, as everything is done in a library (full of template magic), and a standard GCC/Clang will do just fine. Is this correct? (Especially in the case of triSYCL, which I'm using...)
If so... I know that simple expression trees can be extracted by overloading a bunch of operators on custom "handle" or "wrapper" classes, but this is not the case with control flow. Am I wrong?
Section 3.1 of this paper discusses the pros and cons of a few different approaches to adding EDSLs to C++, but I'm more interested in the actual technical implementation of the method SYCL uses.
I tried to look at the source at some SYCL-related projects (Eigen, TensorFlow, triSYCL, ComputeCpp, etc.) but so far I could not find the answer in them.
So: How can a SYCL library(?) discover the full control flow graph of a kernel, given as an ordinary C++ lambda, without needing a custom/extended compiler?

I think you are right.
If you compile SYCL for CPU, since SYCL is a pure C++ executable DSEL, you can have an implementation that just uses a normal C++ compiler.
This is how triSYCL works for example. https://github.com/triSYCL/triSYCL
I do not know the detail about ComputeCpp. On https://github.com/triSYCL/triSYCL/blob/master/doc/about-sycl.rst there is a link about a very interesting but old presentation:
Implementing the OpenCL SYCL Shared Source C++ Programming Model using
Clang/LLVM, Gordon Brown. November 17, 2014, Workshop on the LLVM
Compiler Infrastructure in HPC, SuperComputing 2014
http://www.codeplay.com/public/uploaded/publications/SC2014_LLVM_HPC.pdf
In the case triSYCL is targeting a device, there is also a device compiler. I have to push a new version with a design document... In the meantime, you can look at https://github.com/triSYCL/triSYCL/tree/device https://github.com/triSYCL/llvm https://github.com/triSYCL/clang
sycl-gtx is using some SYCL syntax extensions based on macros to have a meta-representation of the control flow in the kernel, as shown for example on this example: https://github.com/ProGTX/sycl-gtx/blob/master/tests/regression/work_efficient_prefix_sum.cpp

And the answer is: This is not how it's done, and I still don't think it's possible.
Even my first assumption was wrong.
If all you have is an ordinary C++ compiler, then any SYCL kernel can only be executed "in software", by the host device (CPU) running the "controller" code.
To translate the kernels to OpenCL (or SPIR-V) for execution on any other device, either an "augmented" compiler is necessary; or two compilers, one for the host, and one for the compute device.
A good explanation can be found here: https://www.codeplay.com/portal/introduction-to-sycl
The most related section is "What Would A SYCL Work Flow Look Like?"

Related

Use clang as a library to parse OpenCL code extended with some C++ elements

I am currently working on a Source-to-source compiler that transforms code wirtten in an OpenCL superset to "ordinary" OpenCL. I would really like to use clang as a library to parse and analyze the source code. Especially, I really need all the available type information and I would like to have an AST to make use of clang's Rewrite capabilities.
Fortunately, the OpenCL superset that needs to be parsed is really a "mixture" between OpenCL and C++, i.e. the code is basically OpenCL extended with some C++ stuff. In detail, there are possibly template annotations before a function definition and there may be structs containing methods (including operator definitions).
I was hoping that I can use clang to parse this language, since the clang parser is capable of parsing all these constructs. However, I am not sure how (if possible) to tell clang to parse OpenCL and C++ constructs at once. If possible, I really want to avoid touching the clang code base, but I would prefer using clang as a library instead. Maybe it is possible to setup an appropriate instance of clang's LangOptions class that tells clang to parse all these constructs?
Any ideas on how to make clang parse this mixture between OpenCL and C++? Any help is appreciated, and thanks in advance!
You're trying to mix two different front ends, involving both parsing and name resolution.
I think you are in for a rough trip. The key problem is you are trying to glue together things that had no effort expended, to make them gluable. This usually leads to integration hell. You don't see people doing this with Fortran and C++ for the same reasons.
To start with, you'll discover you will have to define the semantics of how the C++ extensions interact with those of OpenCL. If you check out the C++ standard, you'll discover 600 pages of results from committee arguments on how C++ interacts with itself. So unless you can define a radically simple interaction, you'll have a tough time knowing what your mixed OpenCL/C++ program means.
Your second problem will be interleaving the Clang parsing machinery for C++ (AFAIK hand written code) with the Clang parsing machinery for OpenCL (don't know anything about it, but assumed it follows the C++ style). There's no obviously good reason to believe you can just pick and choose these to interleave easily. It may work out fine; just not a bet I'd care to make.
The next place you are likely to have trouble is in building an AST for the joint language. Maybe it is the case that Clang has defined AST nodes for both C++ and OpenCL in a way that easily composes to a joint Clang/OpenCL tree. Since the node types are chosen by hand, and there was no specific reason to design them to work together, it is also not obvious they will compose nicely.
Your last task, given a "valid" OpenCL/C++ tree, to transform it to OpenCL. How in fact will you expand a C++ template (or any general C++ code) to OpenCL code?
[Check my bio for another system, DMS, that might be a bit better for this task; it provides uniform infrastructure for multiple languages that would make some of this easier. Somewhat similar to what you are trying to do, we have used DMS to mix C++ with F90 and APL concepts for easy expression of vector operations in a prototype Vector C++, but we did not try to preserve F90 and APL syntax and semantics exactly for all the above reasons].
It isn't my purpose to rain on your parade; progress is made by the unreasonable man. Just be sure you understand how big a task you are taking on.

Using libraries like boost in cuda device code

I am learning cuda at the moment and I am wondering if it is possible to use functions from different libraries and api's like boost in cuda device code.
Note: I tried using std::cout and that did not work I got printf working after changing the code generation to compute_20,sm_20. I'm using Visual Studio 2010. Cuda 5.0. GPU Nvidia GTX 570. NSIght is installed.
Here's an answer. And here's the CUDA documentation about language support. Boost won't make it for sure.
Since the purpose of using CUDA is to speed up kernels in your code, you'll typically want to limit the language complexities used, because of adding overhead. This will mean that you'll typically stay very close to plain C, with just a few sprinkles of C++ if that's really handy.
Constructs in for example Boost can result in large amounts of assembly code (and C++ in general has been criticised for this and is a reason to not use certain constructs in real-time software). This is all fine for most applications, but not for kernels you'll want to run on a GPU, where every instruction counts.
For CUDA (or OpenCL), people typically write intense algorithms that work on data in arrays. For example special image processing. You only use these techniques to do the calculation intensive tasks of you application. Then you have a 'regular' program that interacts with the user/network/database which creates these CUDA tasks (i.e. selects the data and parameters) and gives starts them. Here are CUDA samples.
Boost uses the expression templates technique not to loose performance while enabling a simpler syntax.
BlueBird and Newton are libraries using metaprogramming, similarly to Boost, enabling CUDA computations.
ArrayFire is another library using Just in Time compilation and exploiting the CUDA language underneath.
Finally, Thrust, as suggested by Njuffa, is a template library enabling CUDA calculations (but not using metaprogramming, see Evaluating expressions consisting of elementwise matrix operations in Thrust).

Explanation of CUDA C and C++

Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?
CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)
CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).
Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.
Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.
What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.

how to use my existing .cpp code to write cuda code [duplicate]

I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.

c++ device driver development in linux

I wanted to get more details for writing Graphics device drivers and audio device drivers using c++ for Linux box.
I am newbie at developing device drivers , Please provide me development/documentation details for the same.
Thanks
-Pravin
Coming to this page late, the question itself has been answered by Chris Stratton, but it's important to correct a couple of things Chris Becke put here that are common misconceptions with people that are not familiar with C++:
C++ does not create implicit code or data, just what you request. Even for an average C++ programmer, there will be no extra code or data. I found it out through knowing the asm behind C++, but just read Scott Meyers books it's good enough.
Not only are exceptions optional in C++, their entire code can be excluded in linkage for mostly every tool out there. This is in fact done in RT apps.
This is to address the misconceptions posted here. To add more however:
1) A novice C++ programmer may do nonsense, but a novice C programmer trying to implement by himself polymorphism and inheritance as done time and time again in the kernel just without calling it as such, will do lots more inefficient undebuggable nonsense.
2) Saying that, the only thing that may be created in base C++ is a virtual pointer IF YOU NEED IT and specify "virtual", and then also C programmers usually just create such a pointer manipulate it by themselves add lookup tables and get much harder bugs down the line due to it. As always in C++, if you don't mention "virtual" then you don't even get this pointer. Inheritance and encapsulation are of course completely free of overhead.
3) C++ creates the same amount of asm and memory as C if you don't EXPLICITLY request special features, but there is a common case when C++ is more efficient - when passing function pointers. If you use C++'s functors you can inline the pointed function. This is EXTREMELY useful in embedded apps.
4) If embedded RT uses C++ why linux doesn't? Just because of myths, so please do read this message carefully, and refer to scott meyers or better yet the asm itself. I am 20 years in RT and had the same disbelief in C++ when I switch 14 years ago, but the facts do not confirm any such distrust.
TL;DR - it's very easy to write as efficient and in a common case more efficient code in C++, studies, much industry experience and books are abound on this subject.
Linux kernel device drivers are written in C rather than C++.
Most device drivers are accessed via a special device file (/dev/yourdevice0) on which control as well as read and write operations can be performed.
User mode client programs and user mode drivers open the device file and use it as a pathway to talk to the kernel mode driver. These user mode drivers could conceivably be written in C++ or any other language.
Generally the best way to get started is to have a device which needs a driver, and learn what you need to in order to write it. And often the best way to do that is to find an existing driver for either a related device, or one with similar interface paradigms, and start by modifying that until it works for your new device instead or as well.
As there is no C++ runtime in the kernel, you will run into problems quickly. I suppose you could make a C++ runtime to run inside the kernel, but it would require some pretty good skills. Much greater skills than writing the driver in C.
Also, you would be put down instantly by Linux kernel developers. I mean REALLY put down. They'd flame you so bad, you'd never recover from it. Chances are that you would say "Screw Linux and their elitist bastards".
I don't want to sound negative, but I'm a mild and suitable voice in comparison to what you'd hear from others.
Linux drivers are developed in C. If you want to learn more about Linux drivers development, you should read this free eBook: http://lwn.net/Kernel/LDD3/
A tarball of all pdf chapters is also available: http://lwn.net/images/pdf/LDD3/ldd3_pdf.tar.bz2
C, not C++ is the language for writing (kernel mode) device drivers, and the reason ultimately is simple: C++ is an inappropriate language to use to write driver software. As a side effect of this, there is no c++ runtime available in kernel mode.
As to why c++ is inappropriate: There are at least two reasons:
Device drivers on all OS require strict code placement - some code needs to be in non pageable blocks, and non pageable memory is a limited resource. c++ generates lots of implicit code, being implicit its impossible to (a) audit, and (b) bracket with the necessary directives to guarantee placement.
exceptions have become non optional in c++. c++ exceptions are typically implemented in terms of CPU exceptions and a lot of driver code on is executed at levels where (cpu) exceptions cannot be tolerated (hence the requirement for non pageable blocks of code).
I think there are some other aspects I am forgetting, but, ideomatic c++ violates a number of constraints placed on drivers. Which is why C is preferred.
This is an old post, but I decide to write an answer since there is no answer explaining about how to do it without getting backfired.
Short Answer:
The answer is "yes, you can" … with tremendous effort. Let's just ignore Linus Trovald's opinion about C++
Long Answer:
My suggestion is writing it in C++ when you REALLY need it.
There are many pitfalls in the way of running C++ in kernel.
I recently study this subject and here’s my summarize:
There is no implementation for new and delete
It is the easiest thing to overcome.
The stack size on kernel is less than 8 kb (for 32bit) and 16kb (for 64bit).
Opps...not getting stack overrun would be challenging.
Following are not allowed
Global non-trivial variables ( there is no C++ runtime initializes it for your, using singleton would be better )
STL ( One of super powers of C++, you need to port C++ stdlib to make it working in kernel )
RTTI
Exceptions
It’s PIA to let C++ read a kernel header. If you like challenges, please at least read C++ in the Linux kernel before you go. Don't forget it has to be done every time while upgrading old code to a newer kernel.
If you’d like to know more in-depth knowledge, check out following articles.
C++ in the Linux kernel (2016)
Porting C++ code to Linux kernel (2009)
C++ article in OSDev.org
Also korisk has a demo repo in github for a barebone kernel module.
Conclusion
Again, my sincere opinion, assess the effort for running C++ in kernel module before you go.
Third time, it's better to assess the effort for running C++ in kernel module before you go!
Unfortunately, the current Linux header files are not compilable in C++ as they use new as a variable name, declare false, true and have some other issues.
You can use C++ in a Linux module but it's useless without including Linux headers. So you can't go far away from the simplest hello-world module.