OpenCL or CUDA Which way to go? - c++

I'm investigating ways of using GPU in order to process streaming data. I had two choices but couldn't decide which way to go?
My criterias are as follows:
Ease of use (good API)
Community and Documentation
Performance
Future
I'll code in C and C++ under linux.

OpenCL
interfaced from your production code
portable between different graphics hardware
limited operations but preprepared shortcuts
CUDA
separate language (CUDA C)
nVidia hardware only
almost full control over the code (coding in a C-like language)
lot of profiling and debugging tools
Bottom line -- OpenCL is portable, CUDA is nVidia only. However, being an independent language, CUDA is much more powerful and has a bunch of really good tools.
Ease of use -- OpenCL is easier to use out of the box, but once you setup the CUDA coding environment it's almost like coding in C.
Community and Documentation -- both have extensive documentation and examples, however I think CUDA has better.
Performance -- CUDA allows for greater control, hence can be better fine-tuned for higher performance.
Future -- hard to say really.

My personal experiences were:
API: OpenCL has slightly more complex api. However most time you will spent with writing kernel code, and here both are almost identical.
Community: CUDA has a much bigger community then OpenCL up til now, but this will probably about to even out.
Documentation: Both are very well documented.
Performance: We made the experience, that OpenCL drivers are not yet fully optimized.
Future: The future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware!
This assessment is from 2010, so probably out-dated.

OpenCL all the way unless you have a specific reason to use CUDA. OpenCL runs well on multicores like Intel i7 in addition to running on GPUs. By using OpenCL you can run it on a much wider range of hardware from Droid cell phones to the IBM Power7 compute nodes of the world's largest supercomputer, Blue Waters, which is supposed to come online next year.

Related

Best way to leverage the GPU

I have a relatively small section of code that deals with huge datasets which I've already parallelized using openmp and am keen to increase performance further using the GPU. The program is C++, developed under VS2015, runs exclusively on Windows and will need to support 64 bit versions from 7 upwards on as wide a variety of GPUs as is feasible. Technologies I've been looking at so far include AMP, OpenCL, HLSL, and CUDA. Questions already asked, such as this with an informative answer by Ade Miller, make me question whether AMP is the way to go although it looks like the easiest option. I'm dismissing CUDA as it limits me in terms of hardware supported, and am tending towards OpenCL while currently working my way through the following book. As such, I've the following questions;
Is OpenCL a good approach here, as other posts suggest it may also be on the way out?
If I go for OpenCL while wanting to support the widest range of GPUs, am I better off with a 1.x version of OpenCL? Reason I ask this is that the OpenCL.DLL downloaded with the latest version of the CUDA SDK is 1.9. I had to download the Intel SDK for OpenCL to get a 2.x version.
If I go with OpenCL, what do I have to distribute with my application (assuming OpenCL.DLL as a minimum) and are there any licensing issues? Are default drivers for most cards going to support OpenCL and if so which versions?
With respect to the above, am I actually better of with AMP, as it works with anything that has DirectX 11 or better?
(Apologies if the above is slightly off topic, if anyone believes that it is perhaps they could point me to a better forum to ask these questions)
Is OpenCL a good approach here, as other posts suggest it may also be on the way out?
OpenCL seems to be most widely supported GPU computing platform. Supported by nVidia, AMD and Intel. Works on most mobile platforms as well. It is also large set of libraries available: ViennaCL, clBLast, clBlast, Boost-Compute and so on.
If I go for OpenCL while wanting to support the widest range of GPUs, am I better off with a 1.x version of OpenCL?
Yes, currently the safest is to stick with 1.2 - and actually it is more then enough.
All major desktop GPU vendors (Intel, AMD, nVidia) support at least OpenCL 1.2.
Actually only nVidia didn't released official 2.0 support - it is still in beta stage.
Also note that some older GPUs will support OpenCL 1.2 only as well.

Is it possible to emulate a GPU for CUDA/OpenCL unit testing purposes?

I would like to develop a library with an algorithm that can run on the CPU or the GPU. The GPU can be Nvidia (then the algorithm will use CUDA) or not (then the algorithm will use OpenCL).
I would like to emulate a GPU in this project because maybe:
I will use different computer to develop the software and some of them don't have a GPU.
The software will be finally executed in servers that can have a GPU or not and the unit test must be executed and passed.
Is there a way to emulate a GPU for unit testing purposes?
In the following link:
GPU Emulator for CUDA programming without the hardware
They show a solution but only for CUDA, not for OpenCL and the software they propose "GPUOcelot" is no longer actively maintained.
It depends on what you mean on emulation. You cannot emulate the speed of GPUs.
The GPU is architecturally very different from the CPU, with a lot of working threads (1000s, 10000s, ...), that's why we use it. The CPU can have only a few threads, even when you parallelize the code. They also have different instruction sets.
You can however emulate the execution using special softwares, like NVEmulate for NVIDIA GPUs and OpenCL Emulator-Debugger for AMD.
A related question: GPU Emulator for CUDA programming without the hardware, where the accepted answer recommends gpuocelot for CUDA emulation.
I don't know the full state of the art but I can provide a very limited set of things to look at which may be useful.
The accepted answer for this question is now out of date.
The question of compiling and runnning GPU code for CUDA or OpenCL on a machine that does not natively support it has come up on here several times (but sadly its often taken as off-topic). This answer is for those questions too.
Many of the answers refer to software solutions that have not been maintained. There seem to be only two answers which stand the test of time which treat this as a mu question.
Use a real GPU - i.e. buy a cheap cuda card if you don't already have one.
Rent someone elses GPU in the cloud
However emulators do exist.
Also GPU virtualization is well covered by the wikipedia page. There is strong support for getting virtual machines to use the hosts hardware.
Docker and virtualbox both for example support GPU passthough.
Reasons to emulate
To learn and keep up to date with changes to CUDA and OpenCL
To estimate the effect of the various APIs on performance.
To test that your code works on a variety of different platforms.
As a proxy for hardware you don't have access to (as per this question)
Kind of emulation
For testing you might accept a slow implementation as long as it is compliant and reliable.
For production running on different hardware you would more likely accept similar, but not 100% equivalent constructs but (e.g. different warp size, different high-level libraries for FFT, ...) and much more complicated performance-optimized implementations of primitives. You would probaly demand at least 80% of the Cuda speed for comparable hardware.
(Thanks to https://stackoverflow.com/users/13130048/sebastian for those two points)
For the second case you would likely need not just GPU virtualisation but additional optimisation passes.
Why are there less emulators and why don't they survive the test of time?
GPUs are affordable. It is only high performance that costs.
GPUs (not to mention TPUs and FPGAs) are developing rapidly.
Some hardware tricks are kept secret from competitors so emulating actual hardware is difficult.
The CUDA and openCL standards are changing too but less quickly.
There is arguably a need for more programmers that understand them. Compiling your code without running and testing it would simply be unprofessional. There would seem to be an obvious need for emulation where you don't have all the possible or interesting hardware combinations physical available.
That being the case its surprising that so many of these emulation projects have not stood the test of time or been endorsed/provided by GPU manufacturers.
There are some active emulation projects however.
Active GPU EMulation Projects
There are at least two active emulation projects maintained as of October 2022:
gpgpu-sim
oclgrind - openCL device simulator
I cannot speak to how good these are and how commonly they are used compared to using real GPUs (either your own or rented).
Honorable mentions
Cuda to OpenCL source to source transpilers.
These appear to be maintained but are not themselves emulators.
CU2CL
coriander
Why is this not a solved problem?
There are a number of challengs to overcome. My take on these would be something like:
provide a runtime emulating a particular version of the CUDA or openCL standard
provide a compiler targeting this runtime (ideally gcc or clang)
get the backing of a vendor (e.g. Nvidia or the kronos group)
get the backing of a community (i.e. a decent userbase and set of contributors)
build support into a popular emulation environment (e.g. virtualbox)
You could also argue the case that almost all people working in this area have access to real GPUs so this is not necessary at all.
The vendors of point 3 are doing well with points 1 and 2 and 4.
An emulator has to both build on that and take some mindshare of its own.
This is an uphill struggle. I hope and believe there will be success in the future.
Looking at virtualbox the last discussion I can find is from 2011.
https://forums.virtualbox.org/viewtopic.php?f=9&t=41155
Seemingly retired projects
These have been mentioned in answers to previous other attempts to ask and answer this kind of question.
gpuocelot - no longer maintained
mcuda - looks unmaintained
cuda-waste - on google code which was frozen long ago
nvemulate - cude emulator Nvidia - retired a while back
Other seemingly retired projects of interest:
openTPU - a Tensor PU emulator from 2017
gdev - 2010
Implementing Open-Source CUDA Runtime - paper from 2013
Earlier (out of date) questions:
GPU Emulator for CUDA programming without the hardware
Asked 2010 - most recent answer 2016
CUDA without CUDA enabled gpu
Asked 2010
How can I emulate a GPU for testing code written in Pytorch?
Asked 2021 - pytorch specific
CUDA code without a GPU
Asked 2014
CUDA on a system that has no GPU
Asked 2013
Using the built-in graphics cards without a NVIDIA graphics card, Can I use the CUDA and Caffe library?
Asked 2016

Confusion on CUDA/openCL and C++ AMP

I read that Microsoft is closely working with Nvidia to improve AMP performances.
But my question is: is AMP a CUDA-replace by Microsoft? Or does AMP use CUDA drivers when a NVIDIA CUDA video card is available? Is AMP an openCL substitute?
I'm still pretty confused..
C++ AMP is a library (and as part of it a key language extension was also introduced). Since C++ AMP is an open specification, it can be implemented on any other low level languages. Microsoft’s implementation builds on DirectCompute (and hence on HLSL), but that is completely hidden from you when you are using C++ AMP (which is why C++ AMP can be an open specification; it does not expose DirectX in the API surface). For more on C++ AMP, please follow the resources on the right of our blog (we’ll keep adding to that):
http://blogs.msdn.com/b/nativeconcurrency/
You made a statement about Microsoft working with NVIDIA to improve C++ AMP performance – that is not true. Microsoft has worked with NVIDA and AMD and other partners to create the C++ AMP open specification. Microsoft also work with hardware vendors to make sure that the hardware vendors have stable video card drivers, which are required for any GPU compute technology to work correctly.
You also expressed confusion and threw some terms out. OpenCL is an approach to GPU computing (by Khronos), as is DirectCompute (by Microsoft), as is CUDA (by NVIDIA). These are all separate technologies, each with its own path to the GPU (always via a driver of some sort), each with its own merits, strengths, and disadvantages. One does not replace the other, and one is not universally better than the other. You now also have C++ AMP in that mix, as one more choice, and the same statements apply to that. The choice is yours as to which you decide to use.
C++ AMP is a set of language extentions and APIs to support parallel programming technology including CUDA.
Since Microsoft also has a direct competitor to CUDA ( Direct Compute) and generally has preferred it's own proprietary graphics standards we will have to see what actually ever happens with it.
For Microsoft's view on it see these lectures

C/C++ Cross platform Library allowing the utilisation of GPU for floating point calculations

Does any one know of any cross platform c/c++ libraries which will utilise the GPU for the purposes of floating point calculations, not specifically graphics oriented calcs. Which ones are in common use, which ones recommended , which ones have you had experience of. Specifically it should be open source with a GPL license.
addendum:- Any libraries you know of that are not GPU manufacturer specific.
addendum:- OpenCL has been brought up in a few answers as having cross GPU compatability. Does anyone have experience using it and can vouch for it's maturity? I'm guessing that if it's Kronos it'll be pretty good.
I would very much doubt that you have a reasonable chance of finding something like this as open source, as "utilise GPU" usually implies "heftily hardware specific, top secret NDA driver stuff".
However, OpenCL is as cross platform as you can get (works with every major vendor and even has at least one software fallback implementation) and it is reasonably free insofar as there are no fees and no restrictions on how you may use it. The only non-free thing is that it's not open source and you can't modify it.
ATI/AMD and nVidia have been offering OpenCL working on G80 and RHD, respectively, for some time, also ATI/AMD has been offering a software implementaion for a good time. As for Intel, I remember reading that they were working for OpenCL for Sandy Bridge generation about a year or so ago, so it should probably be finished by now as well.
How about OpenCL?
Here is the project page at the Kronos Group.
It all depends on the chip you are targeting but NVIDIA offers an SDK in the form of CUDA for Windows, Mac, and Linux. The license is not opensource but depending on what you need that might not actually be a big hurdle.

Is it worth learning AMD-specific APIs?

I'm currently learning the APIs related to Intel's parallelization libraries such as TBB, MKL and IPP. I was wondering, though, whether it's also worth looking at AMD's part of the puzzle. Or would that just be a waste of time? (I must confess, I have no clue about AMD's library support - at all - so would appreciate any advice you might have.)
Just to clarify, the reason I'm going the Intel way is because 1) the APIs are very nice; and 2) Intel seems to be taking tool support as seriously as API support. (Once again, I have no clue how AMD is doing in this department.)
The MKL and IPP libraries will perform (nearly) as well on AMD machines. My guess is that TBB will also run just fine on AMD boxes. If I had to suggest a technology that would be beneficial and useful to both, it would be to master the OpenMP libraries. The Intel compiler with the OpenMP extensions is stunningly fast and works with AMD chips also.
Only worth it if you are specifically interested in building something like Video games, Operating systems, database servers, or virtualization software. In other words: if you have a segment where you care enough about performance to take the time to do it (and do it right) in assembler. The same is true for Intel.
If your company sells packages of just Intel Servers with your software, then you shouldn't bother learning the AMD approach. But if you're going to have to offer software for both (or many) different platforms, then it might be worth looking into the different technologies. It will be very difficult to create the wrappers for the hardware-specific libraries. (Especially since threading is involved.)
And you definitely don't want to write completely separate implementation for each hardware configuration. In fact, if your software is to be consumed by a generic user, then you may want to abandon the Intel technology, and use standard threading techniques. I don't mean to be discouraging, but I believe that the Intel threading libraries are a bit ahead of their time for all intents and purposes.