Sharing multi-threaded code between standard C++ and OpenCL - c++

I am trying to develop a C++ application that will have to run many calculations in parallel. The algorithms will be pretty large but they will be purely mathematical and on floating point integers. They should therefore work on the GPU using OpenCL. I would like the program to work on systems that do not support OpenCL but I would like for it to also be able to use the GPU for extra speed on systems that do.
My problem is that I do not want to have to maintain two sets of code (standard C++ probably using std::thread and OpenCL). I do realize that I will be able to share a lot of the actual code but the only obvious way to do so is to manually copy the shareable parts over between the files and that is really not something that I want to do for maintainability reasons and to prevent errors.
I would like to be able to use only one set of files. Is there any proper way to do this? Would OpenCL emulation be an option in a way?
PS. I know of the AMD OpenCL emulator but it seems to be only for development and also only for Windows. Or am I wrong?

OpenCL can use the CPU as a compute device, so OpenCL code can run on platforms without a GPU. However, the architectural differences between a GPU and a CPU will most likely require you to still maintain two code bases to get optimal performance in both situations.

Related

Is compiling user made c++ code at runtime as an extension a good idea? [duplicate]

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

Using libraries like boost in cuda device code

I am learning cuda at the moment and I am wondering if it is possible to use functions from different libraries and api's like boost in cuda device code.
Note: I tried using std::cout and that did not work I got printf working after changing the code generation to compute_20,sm_20. I'm using Visual Studio 2010. Cuda 5.0. GPU Nvidia GTX 570. NSIght is installed.
Here's an answer. And here's the CUDA documentation about language support. Boost won't make it for sure.
Since the purpose of using CUDA is to speed up kernels in your code, you'll typically want to limit the language complexities used, because of adding overhead. This will mean that you'll typically stay very close to plain C, with just a few sprinkles of C++ if that's really handy.
Constructs in for example Boost can result in large amounts of assembly code (and C++ in general has been criticised for this and is a reason to not use certain constructs in real-time software). This is all fine for most applications, but not for kernels you'll want to run on a GPU, where every instruction counts.
For CUDA (or OpenCL), people typically write intense algorithms that work on data in arrays. For example special image processing. You only use these techniques to do the calculation intensive tasks of you application. Then you have a 'regular' program that interacts with the user/network/database which creates these CUDA tasks (i.e. selects the data and parameters) and gives starts them. Here are CUDA samples.
Boost uses the expression templates technique not to loose performance while enabling a simpler syntax.
BlueBird and Newton are libraries using metaprogramming, similarly to Boost, enabling CUDA computations.
ArrayFire is another library using Just in Time compilation and exploiting the CUDA language underneath.
Finally, Thrust, as suggested by Njuffa, is a template library enabling CUDA calculations (but not using metaprogramming, see Evaluating expressions consisting of elementwise matrix operations in Thrust).

Convert a MPICH2 enabled code to OpenCL code

I have a utility written in c++ and uses MPICH2 it do some heavy computation and I am not happy with it performance and there are many scope of improvement.
Firstly MPICH2 only uses exes, so I have to write my data to a file and pass that file as argument to that utility which again read all the data and write the output back to the file.
If I can have this in dll I can save lots of time in passing the data. Also if I can somehow run this on GPU this will give a boost (not much sure).
I am wondering how much effort will it take to convert the utility code to OpenCL or are there any tools that will do 60-70% of the conversion task.
My best guess is that it will require as much work to convert the code to OpenCL as it would be to transform a comparable serial code to a parallel code. I know of no tools that can automate the process of transforming an MPI code into an OpenCL code. I'd be very interested to learn from others on SO of any such tools.
There has been some research done, and results published, on running MPI on a GPU. My impression is that any of this work is still research grade and probably neither reliable nor portable.
Finally, though it won't help use your GPU, why not correct the faults with your MPI code ? I'm a little unclear, but it seems that one of the problems is that your MPI code writes and reads files as a way of passing data around. This is not a necessary feature of MPI programs and could be revised

User defined CUDA code in C++

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

Package for distributing calculations

Do you know of any package for distributing calculations on several computers and/or several cores on each computer? The calculation code is in c++, the package needs to be able to cope with data >2GB and work on a windows x64 machine. Shareware would be nice, but isn't a requirement.
A suitable solution would depend on the type of calculation and data you wish you process, the granularity of parallelism you wish to achieve, and how much effort you are willing to invest in it.
The simplest would be to just use a suitable solver/library that supports parallelism (e.g.
scalapack). Or if you wish to roll your own solvers, you can squeeze out some paralleisation out of your current code using OpenMP or compilers that provide automatic paralleisation (e.g Intel C/C++ compiler). All these will give you a reasonable performance boost without requiring massive restructuring of your code.
At the other end of the spectrum, you have the MPI option. It can afford you the most performance boost if your algorithm parallelises well. It will however require a fair bit of reengineering.
Another alternative would be to go down the threading route. There are libraries an tools out there that will make this less of a nightmare. These are worth a look: Boost C++ Parallel programming library and Threading Building Block
You may want to look at OpenMP
There's an MPI library and the DVM system working on top of MPI. These are generic tools widely used for parallelizing a variety of tasks.