Using Multiple GPUs with C++ cntk - c++

I'm trying to gradually move over from brainscript to the C++ interface for cntk. The complete lack of documentation doesn't help. My latest project is multi-gpu training. There's an example for single GPU training. What is the best strategy for doing multi-gpu training. Is there a c++ equivalent of the python data_parallel_distributed_learner? (or other parallelisation methods) or do you have to code it yourself at the low level (data selection, model parameter combination, etc.). How does this work with MPI? Is threads/OpenMP an option as with evaluation (in whch case how to select the GPU/combine distributed models).

The Python APIs are mostly following the C++ APIs. So if you understand how to train on multiple GPUs with Python, the C++ is a straight translation from Python. For distributed training you will need
CreateDataParallelDistributedLearner, and to specify the number of workers in the MinibatchSource constructor as well as make sure every worker reads a different part of the data as can be done by the workerRank argument of GetNextMinibatch. As with Python you will need an MPI implementation and to invoke your C++ program with mpirun.

Related

Is compiling user made c++ code at runtime as an extension a good idea? [duplicate]

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

Is it possible to do inference using a pre-trained tensorflow model from inside cuda kernel?

I need to do inference using a tensorflow model from inside a cuda kernel. For that I will need DEVICE functions for inference that can be called from inside the cuda kernel. Did not find anything like that in tensorflow C++ API.
I'm by no means an expert on Tensorflow. But consider that running GPU inference on a non-trivial network will generally involve multiple kernel calls. It would seem highly unlikely that the kind of API you're looking for exists. Even given that launching kernels from within other kernels would be possible in theory (e.g., using Dynamic Parallelism), the whole point of Tensorflow is to describe your computation at a level of abstraction far above anything to do with CUDA. You use Tensorflow to do the mapping to CUDA for you. Tensorflow is basically a kind of compiler that translates your computational graph into whatever it considers the best way of performing the computation described by the graph on the given target hardware. The details of this kind of mapping are highly target-specific and subject to change. Exposing any of this kind of stuff in a public API would seem to go against the very nature of what Tensorflow aims to be. Of course, Tensorflow is open source, so one can always just go look and figure out what exactly the device code generated by Tensorflow looks like and how it has to be invoked. However, the amount of effort required to do so is most likely prohibitive; and the whole thing would have to be expected to break with every new version…
Rather than ask the question of how to manually call the internals of a Tensorflow session, a more fruitful approach would seem to have Tensorflow call you instead. It would seem that, e.g., by adding a custom operation, you could have Tensorflow invoke your GPU code…

Using libraries like boost in cuda device code

I am learning cuda at the moment and I am wondering if it is possible to use functions from different libraries and api's like boost in cuda device code.
Note: I tried using std::cout and that did not work I got printf working after changing the code generation to compute_20,sm_20. I'm using Visual Studio 2010. Cuda 5.0. GPU Nvidia GTX 570. NSIght is installed.
Here's an answer. And here's the CUDA documentation about language support. Boost won't make it for sure.
Since the purpose of using CUDA is to speed up kernels in your code, you'll typically want to limit the language complexities used, because of adding overhead. This will mean that you'll typically stay very close to plain C, with just a few sprinkles of C++ if that's really handy.
Constructs in for example Boost can result in large amounts of assembly code (and C++ in general has been criticised for this and is a reason to not use certain constructs in real-time software). This is all fine for most applications, but not for kernels you'll want to run on a GPU, where every instruction counts.
For CUDA (or OpenCL), people typically write intense algorithms that work on data in arrays. For example special image processing. You only use these techniques to do the calculation intensive tasks of you application. Then you have a 'regular' program that interacts with the user/network/database which creates these CUDA tasks (i.e. selects the data and parameters) and gives starts them. Here are CUDA samples.
Boost uses the expression templates technique not to loose performance while enabling a simpler syntax.
BlueBird and Newton are libraries using metaprogramming, similarly to Boost, enabling CUDA computations.
ArrayFire is another library using Just in Time compilation and exploiting the CUDA language underneath.
Finally, Thrust, as suggested by Njuffa, is a template library enabling CUDA calculations (but not using metaprogramming, see Evaluating expressions consisting of elementwise matrix operations in Thrust).

Explanation of CUDA C and C++

Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?
CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)
CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).
Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.
Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.
What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.

User defined CUDA code in C++

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.