Convert a MPICH2 enabled code to OpenCL code - c++

I have a utility written in c++ and uses MPICH2 it do some heavy computation and I am not happy with it performance and there are many scope of improvement.
Firstly MPICH2 only uses exes, so I have to write my data to a file and pass that file as argument to that utility which again read all the data and write the output back to the file.
If I can have this in dll I can save lots of time in passing the data. Also if I can somehow run this on GPU this will give a boost (not much sure).
I am wondering how much effort will it take to convert the utility code to OpenCL or are there any tools that will do 60-70% of the conversion task.

My best guess is that it will require as much work to convert the code to OpenCL as it would be to transform a comparable serial code to a parallel code. I know of no tools that can automate the process of transforming an MPI code into an OpenCL code. I'd be very interested to learn from others on SO of any such tools.
There has been some research done, and results published, on running MPI on a GPU. My impression is that any of this work is still research grade and probably neither reliable nor portable.
Finally, though it won't help use your GPU, why not correct the faults with your MPI code ? I'm a little unclear, but it seems that one of the problems is that your MPI code writes and reads files as a way of passing data around. This is not a necessary feature of MPI programs and could be revised

Related

Is compiling user made c++ code at runtime as an extension a good idea? [duplicate]

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

Sharing multi-threaded code between standard C++ and OpenCL

I am trying to develop a C++ application that will have to run many calculations in parallel. The algorithms will be pretty large but they will be purely mathematical and on floating point integers. They should therefore work on the GPU using OpenCL. I would like the program to work on systems that do not support OpenCL but I would like for it to also be able to use the GPU for extra speed on systems that do.
My problem is that I do not want to have to maintain two sets of code (standard C++ probably using std::thread and OpenCL). I do realize that I will be able to share a lot of the actual code but the only obvious way to do so is to manually copy the shareable parts over between the files and that is really not something that I want to do for maintainability reasons and to prevent errors.
I would like to be able to use only one set of files. Is there any proper way to do this? Would OpenCL emulation be an option in a way?
PS. I know of the AMD OpenCL emulator but it seems to be only for development and also only for Windows. Or am I wrong?
OpenCL can use the CPU as a compute device, so OpenCL code can run on platforms without a GPU. However, the architectural differences between a GPU and a CPU will most likely require you to still maintain two code bases to get optimal performance in both situations.

User defined CUDA code in C++

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

how to use my existing .cpp code to write cuda code [duplicate]

I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.

dynamic code compilation

I'm working on a program that renders iterated fractal systems. I wanted to add the functionality where someone could define their own iteration process, and compile that code so that it would run efficiently.
I currently don't know how to do this and would like tips on what to read to learn how to do this.
The main program is written in C++ and I'm familiar with C++. In fact given most of the scenarios I know how to convert it to assembly code that would accomplish the goal, but I don't know how to take the extra step to convert it to machine code. If possible I'd like to dynamically compile the code like how I believe many game system emulators work.
If it is unclear what I'm asking, tell me so I can clarify.
Thanks!
Does the routine to be compiled dynamically need to be in any particular language. If the answer to that question is "Yes, it must be C++" you're probably out of luck. C++ is about the worst possible choice for online recompilation.
Is the dynamic portion of your application (the fractal iterator routine) a major CPU bottleneck? If you can afford using a language that isn't compiled, you can probably save yourself an awful lot of trouble. Lua and JavaScript are both heavily optimized interpreted languages that only run a few times slower than native, compiled code.
If you really need the dynamic functionality to be compiled to machine code, your best bet is probably going to be using clang/llvm. clang is the C/Objective-C front end being developed by Apple (and a few others) to make online, dynamic recompilation perform well. llvm is the backend clang uses to translate from a portable bytecode to native machine code. Be advised that clang does not currently support much of C++, since that's such a difficult language to get right.
Some CPU emulators treat the machine code as if it was byte code and they do a JIT compile, almost as if it was Java. This is very efficient, but it means that the developers need to write a version of the compiler for each CPU their emulator runs on and for each CPU emulated.
That usually means it only works on x86 and is annoying to anyone who would like to use something different.
They could also translate it to LLVM or Java byte code or .Net CIL and then compile it, which would also work.
In your case I am not sure that sort of thing is the best way to go. I think that I would do this by using dynamic libraries. Make a directory that is supposed to contain "plugins" and let the user compile their own. Make your program scan the directory and load each DLL or .so it finds.
Doing it this way means you spend less time writing code compilers and more time actually getting stuff done.
If you can write your dynamic extensions in C (not C++), you might find the Tiny C Compiler to be of use. It's available under the LGPL, it's compatible for Windows and Linux, and it's a small executable (or library) at ~100kb for the preprocessor, compiler, linker and assembler, all of which it does very fast. The downside to that, of course, is that it can't compare to the optimizations you can get with GCC. Another potential downside is that it's X86 only AFAIK.
If you did decide to write assembly, TCC can handle that -- the documentation says it supports a gas-like syntax, and it does support X86 opcodes.
TCC also fully supports ANSI C, and it's nearly fully compliant with C99.
That being said, you could either include TCC as an executable with your application or use libtcc (there's not too much documentation of libtcc online, but it's available in the source package). Either way, you can use tcc to generate dynamic or shared libraries, or executables. If you went the dynamic library route, you would just put in a Render (or whatever) function in it, and dlopen or LoadLibrary on it, and call Render to finally run the user-designed rendering. Alternatively, you could make a standalone executable and popen it, and do all your communication through the standalone's stdin and stdout.
Since you're generating pixels to be displayed on a screen, have you considered using HLSL with dynamic shader compile? That will give you access to SIMD hardware designed for exactly this sort of thing, as well as the dynamic compiler built right into DirectX.
LLVM should be able to do what you want to do. It allows you to form a description of the program you'd like to compile in an object-oriented manner, and then it can compile that program description into native machine code at runtime.
Nanojit is a pretty good example of what you want. It generates machine code from an intermediate langauge. It's C++, and it's small and cross-platform. I haven't used it very extensively, but I enjoyed toying around just for demos.
Spit the code to a file and compile it as a dynamically loaded library, then load it and call it.
Is there are reason why you can't use a GPU-based solutions? This seems to be screaming for one.