Is it worth learning AMD-specific APIs? - c++

I'm currently learning the APIs related to Intel's parallelization libraries such as TBB, MKL and IPP. I was wondering, though, whether it's also worth looking at AMD's part of the puzzle. Or would that just be a waste of time? (I must confess, I have no clue about AMD's library support - at all - so would appreciate any advice you might have.)
Just to clarify, the reason I'm going the Intel way is because 1) the APIs are very nice; and 2) Intel seems to be taking tool support as seriously as API support. (Once again, I have no clue how AMD is doing in this department.)

The MKL and IPP libraries will perform (nearly) as well on AMD machines. My guess is that TBB will also run just fine on AMD boxes. If I had to suggest a technology that would be beneficial and useful to both, it would be to master the OpenMP libraries. The Intel compiler with the OpenMP extensions is stunningly fast and works with AMD chips also.

Only worth it if you are specifically interested in building something like Video games, Operating systems, database servers, or virtualization software. In other words: if you have a segment where you care enough about performance to take the time to do it (and do it right) in assembler. The same is true for Intel.

If your company sells packages of just Intel Servers with your software, then you shouldn't bother learning the AMD approach. But if you're going to have to offer software for both (or many) different platforms, then it might be worth looking into the different technologies. It will be very difficult to create the wrappers for the hardware-specific libraries. (Especially since threading is involved.)
And you definitely don't want to write completely separate implementation for each hardware configuration. In fact, if your software is to be consumed by a generic user, then you may want to abandon the Intel technology, and use standard threading techniques. I don't mean to be discouraging, but I believe that the Intel threading libraries are a bit ahead of their time for all intents and purposes.

Related

Is it possible to emulate a GPU for CUDA/OpenCL unit testing purposes?

I would like to develop a library with an algorithm that can run on the CPU or the GPU. The GPU can be Nvidia (then the algorithm will use CUDA) or not (then the algorithm will use OpenCL).
I would like to emulate a GPU in this project because maybe:
I will use different computer to develop the software and some of them don't have a GPU.
The software will be finally executed in servers that can have a GPU or not and the unit test must be executed and passed.
Is there a way to emulate a GPU for unit testing purposes?
In the following link:
GPU Emulator for CUDA programming without the hardware
They show a solution but only for CUDA, not for OpenCL and the software they propose "GPUOcelot" is no longer actively maintained.
It depends on what you mean on emulation. You cannot emulate the speed of GPUs.
The GPU is architecturally very different from the CPU, with a lot of working threads (1000s, 10000s, ...), that's why we use it. The CPU can have only a few threads, even when you parallelize the code. They also have different instruction sets.
You can however emulate the execution using special softwares, like NVEmulate for NVIDIA GPUs and OpenCL Emulator-Debugger for AMD.
A related question: GPU Emulator for CUDA programming without the hardware, where the accepted answer recommends gpuocelot for CUDA emulation.
I don't know the full state of the art but I can provide a very limited set of things to look at which may be useful.
The accepted answer for this question is now out of date.
The question of compiling and runnning GPU code for CUDA or OpenCL on a machine that does not natively support it has come up on here several times (but sadly its often taken as off-topic). This answer is for those questions too.
Many of the answers refer to software solutions that have not been maintained. There seem to be only two answers which stand the test of time which treat this as a mu question.
Use a real GPU - i.e. buy a cheap cuda card if you don't already have one.
Rent someone elses GPU in the cloud
However emulators do exist.
Also GPU virtualization is well covered by the wikipedia page. There is strong support for getting virtual machines to use the hosts hardware.
Docker and virtualbox both for example support GPU passthough.
Reasons to emulate
To learn and keep up to date with changes to CUDA and OpenCL
To estimate the effect of the various APIs on performance.
To test that your code works on a variety of different platforms.
As a proxy for hardware you don't have access to (as per this question)
Kind of emulation
For testing you might accept a slow implementation as long as it is compliant and reliable.
For production running on different hardware you would more likely accept similar, but not 100% equivalent constructs but (e.g. different warp size, different high-level libraries for FFT, ...) and much more complicated performance-optimized implementations of primitives. You would probaly demand at least 80% of the Cuda speed for comparable hardware.
(Thanks to https://stackoverflow.com/users/13130048/sebastian for those two points)
For the second case you would likely need not just GPU virtualisation but additional optimisation passes.
Why are there less emulators and why don't they survive the test of time?
GPUs are affordable. It is only high performance that costs.
GPUs (not to mention TPUs and FPGAs) are developing rapidly.
Some hardware tricks are kept secret from competitors so emulating actual hardware is difficult.
The CUDA and openCL standards are changing too but less quickly.
There is arguably a need for more programmers that understand them. Compiling your code without running and testing it would simply be unprofessional. There would seem to be an obvious need for emulation where you don't have all the possible or interesting hardware combinations physical available.
That being the case its surprising that so many of these emulation projects have not stood the test of time or been endorsed/provided by GPU manufacturers.
There are some active emulation projects however.
Active GPU EMulation Projects
There are at least two active emulation projects maintained as of October 2022:
gpgpu-sim
oclgrind - openCL device simulator
I cannot speak to how good these are and how commonly they are used compared to using real GPUs (either your own or rented).
Honorable mentions
Cuda to OpenCL source to source transpilers.
These appear to be maintained but are not themselves emulators.
CU2CL
coriander
Why is this not a solved problem?
There are a number of challengs to overcome. My take on these would be something like:
provide a runtime emulating a particular version of the CUDA or openCL standard
provide a compiler targeting this runtime (ideally gcc or clang)
get the backing of a vendor (e.g. Nvidia or the kronos group)
get the backing of a community (i.e. a decent userbase and set of contributors)
build support into a popular emulation environment (e.g. virtualbox)
You could also argue the case that almost all people working in this area have access to real GPUs so this is not necessary at all.
The vendors of point 3 are doing well with points 1 and 2 and 4.
An emulator has to both build on that and take some mindshare of its own.
This is an uphill struggle. I hope and believe there will be success in the future.
Looking at virtualbox the last discussion I can find is from 2011.
https://forums.virtualbox.org/viewtopic.php?f=9&t=41155
Seemingly retired projects
These have been mentioned in answers to previous other attempts to ask and answer this kind of question.
gpuocelot - no longer maintained
mcuda - looks unmaintained
cuda-waste - on google code which was frozen long ago
nvemulate - cude emulator Nvidia - retired a while back
Other seemingly retired projects of interest:
openTPU - a Tensor PU emulator from 2017
gdev - 2010
Implementing Open-Source CUDA Runtime - paper from 2013
Earlier (out of date) questions:
GPU Emulator for CUDA programming without the hardware
Asked 2010 - most recent answer 2016
CUDA without CUDA enabled gpu
Asked 2010
How can I emulate a GPU for testing code written in Pytorch?
Asked 2021 - pytorch specific
CUDA code without a GPU
Asked 2014
CUDA on a system that has no GPU
Asked 2013
Using the built-in graphics cards without a NVIDIA graphics card, Can I use the CUDA and Caffe library?
Asked 2016

OpenHMPP in GCC

The gist of the question is:
Do you know any projects that aim to bring OpenHMPP support to GCC? I could also possibly live with affordable commercial compilers, but it's very unlikely, because I prefer Linux, and I would like the compiler to support non-x86 architectures as well.
And the background story:
I know OpenCL and CUDA people will bash me, but here goes my experience/opinion: I've been pursuing some toy projects to get into many core processing using CUDA and OpenCL. I feel that it's such a mess to set up those development environments (especially under linux and especially if you've the slightest bit of irregularity in your system). Even when you set them up, it's still a mess to run them anywhere other than your development environment. Finally (and probably the most importantly) these languages are very verbose and tiresome. I feel like they're the assembler of many-core processing. Compare them to OpenMP, and you see how they could actually be.
At this point, OpenHMPP comes into the scene. It uses #pragma statements like OpenMP and it seems to be a very good step in the right direction. However, it's very hard to find compilers for it. CAPS Enterprize and Pathscale do have OpenHMPP support, but they're very expensive (€4000 for CAPS, I couldn't find the price for Pathscale). And correct me if I'm wrong, but CAPS seems to support C, not C++.
So, we return to the gist. It would be like a dream, to have OpenHMPP support in GCC. Do you know of any open-source projects or any affordable alternatives? Maybe even, do you know of alternatives to OpenHMPP that are easier to find support for.
If I understand you correctly, you are looking for ways to simplify access to accelerator devices, which may be GPUs as well as multi-core CPUs.
This is a field with a lot of academic work happening right now, resulting in many publications describing such frameworks, however only few are actually available.
In fact, the reasons you are stating are the basis of my research, which is also far from complete or in a state usable by anyone else...
The only thing I know, that comes close to what you seek (using #pragmas to access accelerators), would be MGP from the Virtual OpenCL package.
All other solutions are more intrusive by requiring the use of their API.
I have not yet had a closer look to AMP for C++, but it might be interesting if it picks up some pace.

C/C++ Cross platform Library allowing the utilisation of GPU for floating point calculations

Does any one know of any cross platform c/c++ libraries which will utilise the GPU for the purposes of floating point calculations, not specifically graphics oriented calcs. Which ones are in common use, which ones recommended , which ones have you had experience of. Specifically it should be open source with a GPL license.
addendum:- Any libraries you know of that are not GPU manufacturer specific.
addendum:- OpenCL has been brought up in a few answers as having cross GPU compatability. Does anyone have experience using it and can vouch for it's maturity? I'm guessing that if it's Kronos it'll be pretty good.
I would very much doubt that you have a reasonable chance of finding something like this as open source, as "utilise GPU" usually implies "heftily hardware specific, top secret NDA driver stuff".
However, OpenCL is as cross platform as you can get (works with every major vendor and even has at least one software fallback implementation) and it is reasonably free insofar as there are no fees and no restrictions on how you may use it. The only non-free thing is that it's not open source and you can't modify it.
ATI/AMD and nVidia have been offering OpenCL working on G80 and RHD, respectively, for some time, also ATI/AMD has been offering a software implementaion for a good time. As for Intel, I remember reading that they were working for OpenCL for Sandy Bridge generation about a year or so ago, so it should probably be finished by now as well.
How about OpenCL?
Here is the project page at the Kronos Group.
It all depends on the chip you are targeting but NVIDIA offers an SDK in the form of CUDA for Windows, Mac, and Linux. The license is not opensource but depending on what you need that might not actually be a big hurdle.

OpenCL or CUDA Which way to go?

I'm investigating ways of using GPU in order to process streaming data. I had two choices but couldn't decide which way to go?
My criterias are as follows:
Ease of use (good API)
Community and Documentation
Performance
Future
I'll code in C and C++ under linux.
OpenCL
interfaced from your production code
portable between different graphics hardware
limited operations but preprepared shortcuts
CUDA
separate language (CUDA C)
nVidia hardware only
almost full control over the code (coding in a C-like language)
lot of profiling and debugging tools
Bottom line -- OpenCL is portable, CUDA is nVidia only. However, being an independent language, CUDA is much more powerful and has a bunch of really good tools.
Ease of use -- OpenCL is easier to use out of the box, but once you setup the CUDA coding environment it's almost like coding in C.
Community and Documentation -- both have extensive documentation and examples, however I think CUDA has better.
Performance -- CUDA allows for greater control, hence can be better fine-tuned for higher performance.
Future -- hard to say really.
My personal experiences were:
API: OpenCL has slightly more complex api. However most time you will spent with writing kernel code, and here both are almost identical.
Community: CUDA has a much bigger community then OpenCL up til now, but this will probably about to even out.
Documentation: Both are very well documented.
Performance: We made the experience, that OpenCL drivers are not yet fully optimized.
Future: The future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware!
This assessment is from 2010, so probably out-dated.
OpenCL all the way unless you have a specific reason to use CUDA. OpenCL runs well on multicores like Intel i7 in addition to running on GPUs. By using OpenCL you can run it on a much wider range of hardware from Droid cell phones to the IBM Power7 compute nodes of the world's largest supercomputer, Blue Waters, which is supposed to come online next year.

Package for distributing calculations

Do you know of any package for distributing calculations on several computers and/or several cores on each computer? The calculation code is in c++, the package needs to be able to cope with data >2GB and work on a windows x64 machine. Shareware would be nice, but isn't a requirement.
A suitable solution would depend on the type of calculation and data you wish you process, the granularity of parallelism you wish to achieve, and how much effort you are willing to invest in it.
The simplest would be to just use a suitable solver/library that supports parallelism (e.g.
scalapack). Or if you wish to roll your own solvers, you can squeeze out some paralleisation out of your current code using OpenMP or compilers that provide automatic paralleisation (e.g Intel C/C++ compiler). All these will give you a reasonable performance boost without requiring massive restructuring of your code.
At the other end of the spectrum, you have the MPI option. It can afford you the most performance boost if your algorithm parallelises well. It will however require a fair bit of reengineering.
Another alternative would be to go down the threading route. There are libraries an tools out there that will make this less of a nightmare. These are worth a look: Boost C++ Parallel programming library and Threading Building Block
You may want to look at OpenMP
There's an MPI library and the DVM system working on top of MPI. These are generic tools widely used for parallelizing a variety of tasks.