Using libraries like boost in cuda device code

Using libraries like boost in cuda device code - c++

I am learning cuda at the moment and I am wondering if it is possible to use functions from different libraries and api's like boost in cuda device code.
Note: I tried using std::cout and that did not work I got printf working after changing the code generation to compute_20,sm_20. I'm using Visual Studio 2010. Cuda 5.0. GPU Nvidia GTX 570. NSIght is installed.

Here's an answer. And here's the CUDA documentation about language support. Boost won't make it for sure.
Since the purpose of using CUDA is to speed up kernels in your code, you'll typically want to limit the language complexities used, because of adding overhead. This will mean that you'll typically stay very close to plain C, with just a few sprinkles of C++ if that's really handy.
Constructs in for example Boost can result in large amounts of assembly code (and C++ in general has been criticised for this and is a reason to not use certain constructs in real-time software). This is all fine for most applications, but not for kernels you'll want to run on a GPU, where every instruction counts.
For CUDA (or OpenCL), people typically write intense algorithms that work on data in arrays. For example special image processing. You only use these techniques to do the calculation intensive tasks of you application. Then you have a 'regular' program that interacts with the user/network/database which creates these CUDA tasks (i.e. selects the data and parameters) and gives starts them. Here are CUDA samples.

Boost uses the expression templates technique not to loose performance while enabling a simpler syntax.
BlueBird and Newton are libraries using metaprogramming, similarly to Boost, enabling CUDA computations.
ArrayFire is another library using Just in Time compilation and exploiting the CUDA language underneath.
Finally, Thrust, as suggested by Njuffa, is a template library enabling CUDA calculations (but not using metaprogramming, see Evaluating expressions consisting of elementwise matrix operations in Thrust).

Related

Parallel computation with unreal engine 4

I am currently researching the usability of Unreal Engine for a computational intensive project and Google have not been terribly helpful.
I need to do some heavy computation, in the background of the game loop, for the project, which is currently implemented using OpenMP. A fluid simulation.
Unreal Engine on windows compiles and builds using visual studio and should as such have support for what the visual studio compiler has, but it is unclear to me if you can make UE compile with the proper settings for OpenMP.
Alternatively, I could easily make this work with Intels threading building blocks TBB, which is apparently already used in UE... though perhaps only the memory allocator part?
In general I guess my question is what support there is for cross platform heavily parallelized computation with UE. If you need to do a large number of identical computations running on all available cores, how do you do that in UE?
I am not interested in manually creating some number of threads, run them on blocks code, wait for completion by joining and then move on. I would much prefer a parallel_for loop for my purpose.
Further more, I wonder how easy it is to use other frameworks such as CUDA or OpenCL from within a UE "script".
I come from Unity where I made c# wrappers around native dlls, written in c++, for high performance, but I am getting a bit tired of this wrapping native in c#, so I would much prefer a c++ based engine... but if that itself is limited in what I can do in my c++ code, then it is not much help... I assume it can all work, when I learn how, but I am a bit confused by not being able to find any info about any of the above questions. Anything with the word "parallel" in it generally lead to blueprints.

There's a ParallelFor() in UE, defined here: "Engine\Source\Runtime\Core\Public\Async\ParallelFor.h"
Signature is void ParallelFor(int32 Num, TFunctionRef<void(int32)> Body, bool bForceSingleThread = false)
Example use can be found easily in the source codes.

Convert CPU library to CUDA

I have a CPU library that I am using (ICA-Independent Component Analysis). I need to convert my project to CUDA, but I don't seem to find a library for CUDA. Is it possible to easily convert the library and use it in CUDA or do I have to rewrite the library by myself?
Thanks.

You will have to modify that library. It might be easy, it might not. (I cannot tell, because I do not know what is in that library.)
NVIDIA offers an OpenACC compiler here.
There are many libraries available that use CUDA to perform computation on the GPU: https://developer.nvidia.com/gpu-accelerated-libraries
I suggest using Thrust since it comes bundled with the CUDA SDK.

I'd suggest to look into OpenACC. In case your code is sufficiently GPU friendly (storage order, high parallel task granularity) you should be able to convert quite quickly. If you already have OpenMP directives you can basically add OpenACC kernel directives there, test it, then profile your data movement and probably you will need to manually specify a GPU data region to avoid too many copy operations.

Sharing multi-threaded code between standard C++ and OpenCL

I am trying to develop a C++ application that will have to run many calculations in parallel. The algorithms will be pretty large but they will be purely mathematical and on floating point integers. They should therefore work on the GPU using OpenCL. I would like the program to work on systems that do not support OpenCL but I would like for it to also be able to use the GPU for extra speed on systems that do.
My problem is that I do not want to have to maintain two sets of code (standard C++ probably using std::thread and OpenCL). I do realize that I will be able to share a lot of the actual code but the only obvious way to do so is to manually copy the shareable parts over between the files and that is really not something that I want to do for maintainability reasons and to prevent errors.
I would like to be able to use only one set of files. Is there any proper way to do this? Would OpenCL emulation be an option in a way?
PS. I know of the AMD OpenCL emulator but it seems to be only for development and also only for Windows. Or am I wrong?

OpenCL can use the CPU as a compute device, so OpenCL code can run on platforms without a GPU. However, the architectural differences between a GPU and a CPU will most likely require you to still maintain two code bases to get optimal performance in both situations.

C++ container and openCL

I am looking into some C++ code which extensively uses boost::multi_array<double>.
The next step is to port the code to use openCL. As I'm quite new to openCL I don't know what I should do with the multi_array. Should I rewrite it into a nested-openCL-vector or nested-c-array.
What would you do?

There are boost like libraries already exist for OpenCL, you may want to look at the following libraries from GPU Vendors
Thurst from NVIDIA: Thrust is a powerful library of parallel algorithms and data structures. Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity. Using Thrust, C++ developers can write just a few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations orders of magnitude faster than the latest multi-core CPUs. For example, the thrust::sort algorithm delivers 5x to 100x faster sorting performance than STL and TBB.
AMD offers Math Libraries are software libraries containing FFT and BLAS functions written in OpenCL and designed to run on AMD GPUs for more info look here:
http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx

Explanation of CUDA C and C++

Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?

CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)

CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).

Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.

Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.

What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js