Is there 1 SYCL implementation to rule all platforms?

Is there 1 SYCL implementation to rule all platforms? - c++

Apologies for the slightly jokey title, but I couldn't find another way to concisely describe the question. I work in a team that use predominantly OpenCL code with a CPU fallback. For the most part this works fine, except when it comes to Nvidia and their refusal to use SPIR-V for OpenCL.
I recently found and have been looking into SYCL, but the ecosystem surrounding it is more than a little bit confusing, and in one case I found one implementation referring to using another implementation.
So my question is: is there a single SYCL implementation that can produce a single binary that has runtime support for Nvidia, AMD and Intel (preferred, but not required) and either x64 or Arm64 (we would create a second binary for the other one) without having to do what we do now which is select a bunch of GPUs from the various vendors build the kernels for each one separately and then have to ship them all.
Thanks

As of December 2022, for Linux and x86_64:
The open-source version of DPC++ can compile code for all three GPU vendors. In my experience, a single binary for all three vendors works.
hipSYCL has official support for NVIDIA and AMD devices, and experimental support for Intel GPUs (via the above-mentioned DPC++).
without having to do what we do now which is select a bunch of GPUs from the various vendors build the kernels for each one separately and then have to ship them all.
Note: Under the hood, both hipSYCL and DPC++ work this way. The kernels are compiled to PTX, GCN, and/or SPIR-V. They are bundled into a single binary, though, so, in this respect, the distribution can be simpler (or not: you will likely have to also ship the SYCL runtime libraries with your application).

Related

Best way to leverage the GPU

I have a relatively small section of code that deals with huge datasets which I've already parallelized using openmp and am keen to increase performance further using the GPU. The program is C++, developed under VS2015, runs exclusively on Windows and will need to support 64 bit versions from 7 upwards on as wide a variety of GPUs as is feasible. Technologies I've been looking at so far include AMP, OpenCL, HLSL, and CUDA. Questions already asked, such as this with an informative answer by Ade Miller, make me question whether AMP is the way to go although it looks like the easiest option. I'm dismissing CUDA as it limits me in terms of hardware supported, and am tending towards OpenCL while currently working my way through the following book. As such, I've the following questions;
Is OpenCL a good approach here, as other posts suggest it may also be on the way out?
If I go for OpenCL while wanting to support the widest range of GPUs, am I better off with a 1.x version of OpenCL? Reason I ask this is that the OpenCL.DLL downloaded with the latest version of the CUDA SDK is 1.9. I had to download the Intel SDK for OpenCL to get a 2.x version.
If I go with OpenCL, what do I have to distribute with my application (assuming OpenCL.DLL as a minimum) and are there any licensing issues? Are default drivers for most cards going to support OpenCL and if so which versions?
With respect to the above, am I actually better of with AMP, as it works with anything that has DirectX 11 or better?
(Apologies if the above is slightly off topic, if anyone believes that it is perhaps they could point me to a better forum to ask these questions)

Is OpenCL a good approach here, as other posts suggest it may also be on the way out?
OpenCL seems to be most widely supported GPU computing platform. Supported by nVidia, AMD and Intel. Works on most mobile platforms as well. It is also large set of libraries available: ViennaCL, clBLast, clBlast, Boost-Compute and so on.
If I go for OpenCL while wanting to support the widest range of GPUs, am I better off with a 1.x version of OpenCL?
Yes, currently the safest is to stick with 1.2 - and actually it is more then enough.
All major desktop GPU vendors (Intel, AMD, nVidia) support at least OpenCL 1.2.
Actually only nVidia didn't released official 2.0 support - it is still in beta stage.
Also note that some older GPUs will support OpenCL 1.2 only as well.

Why do we need SPIR-V?

I have been reading about heterogeneous computing and came across SPIR-V. There I found the following:
SPIR-V is the first open standard, cross-API intermediate language for natively representing parallel compute and graphics..
From this image I can see that all the high-level languages such as GLSL, HLSL, OpenCL C, etc., are compiled into SPIR-V and in this way passed to the correct physical device to be executed.
My question is why do we need to compile our shader/kernel code to SPIR-V and not compile it directly into machine instructions that will be executed by the chosen physical device? In case this question is not correct, can you please explain why do we need SPIR-V at all?

In general you can split a compiler into two parts: a front-end for a specific language (or family of languages), and a back-end, which is language-agnostic and can generate machine code for one or more specific architectures (you can break this down further, but that's enough for now). Optimizations can happen in both parts; some are more appropriate in either place. This is the relationship between clang and LLVM for example: clang is a front-end for C-family languages, and LLVM is the backend.
Because different GPUs have significantly different machine code (often much more different than, say, arm64 vs. x86_64), the backend compiler needs to be in the GPU driver. But there's no reason to have the front-end be there too, even though that's how it worked in OpenGL. By separating the two, and using SPIR-V as the language they use to communicate, we get:
One parsing and syntax checking implementation, instead of one per vendor. This means developers get to target just one variant of the language, instead of a bunch of vendor-specific variants (due to implementing different versions, bugs, differences in interpretation, etc.)
Support for multiple languages. You can use ESSL (OpenGL ES's variant of GLSL), GLSL, HLSL, and OpenCL-C to write Vulkan shaders, making it easier for developers to support multiple APIs. All emit SPIR-V, so drivers don't have to support each of these languages. In theory someone could design their own language, or support MetalSL, etc.
Since SPIR-V is meant to be machine-written / machine-read instead of human-friendly, it is a simpler and more regular language than GLSL. So it should be easier to get all vendors to implement it with high-quality. (At the moment, implementations are a lot less mature than in GL drivers, so we're not quite there yet.)
Some expensive optimizations can be done offline, e.g. as part of the app build process, instead of at runtime when you're trying to finish a frame in 16 or 33 milliseconds.

Is it possible to emulate a GPU for CUDA/OpenCL unit testing purposes?

I would like to develop a library with an algorithm that can run on the CPU or the GPU. The GPU can be Nvidia (then the algorithm will use CUDA) or not (then the algorithm will use OpenCL).
I would like to emulate a GPU in this project because maybe:
I will use different computer to develop the software and some of them don't have a GPU.
The software will be finally executed in servers that can have a GPU or not and the unit test must be executed and passed.
Is there a way to emulate a GPU for unit testing purposes?
In the following link:
GPU Emulator for CUDA programming without the hardware
They show a solution but only for CUDA, not for OpenCL and the software they propose "GPUOcelot" is no longer actively maintained.

It depends on what you mean on emulation. You cannot emulate the speed of GPUs.
The GPU is architecturally very different from the CPU, with a lot of working threads (1000s, 10000s, ...), that's why we use it. The CPU can have only a few threads, even when you parallelize the code. They also have different instruction sets.
You can however emulate the execution using special softwares, like NVEmulate for NVIDIA GPUs and OpenCL Emulator-Debugger for AMD.
A related question: GPU Emulator for CUDA programming without the hardware, where the accepted answer recommends gpuocelot for CUDA emulation.

I don't know the full state of the art but I can provide a very limited set of things to look at which may be useful.
The accepted answer for this question is now out of date.
The question of compiling and runnning GPU code for CUDA or OpenCL on a machine that does not natively support it has come up on here several times (but sadly its often taken as off-topic). This answer is for those questions too.
Many of the answers refer to software solutions that have not been maintained. There seem to be only two answers which stand the test of time which treat this as a mu question.
Use a real GPU - i.e. buy a cheap cuda card if you don't already have one.
Rent someone elses GPU in the cloud
However emulators do exist.
Also GPU virtualization is well covered by the wikipedia page. There is strong support for getting virtual machines to use the hosts hardware.
Docker and virtualbox both for example support GPU passthough.
Reasons to emulate
To learn and keep up to date with changes to CUDA and OpenCL
To estimate the effect of the various APIs on performance.
To test that your code works on a variety of different platforms.
As a proxy for hardware you don't have access to (as per this question)
Kind of emulation
For testing you might accept a slow implementation as long as it is compliant and reliable.
For production running on different hardware you would more likely accept similar, but not 100% equivalent constructs but (e.g. different warp size, different high-level libraries for FFT, ...) and much more complicated performance-optimized implementations of primitives. You would probaly demand at least 80% of the Cuda speed for comparable hardware.
(Thanks to https://stackoverflow.com/users/13130048/sebastian for those two points)
For the second case you would likely need not just GPU virtualisation but additional optimisation passes.
Why are there less emulators and why don't they survive the test of time?
GPUs are affordable. It is only high performance that costs.
GPUs (not to mention TPUs and FPGAs) are developing rapidly.
Some hardware tricks are kept secret from competitors so emulating actual hardware is difficult.
The CUDA and openCL standards are changing too but less quickly.
There is arguably a need for more programmers that understand them. Compiling your code without running and testing it would simply be unprofessional. There would seem to be an obvious need for emulation where you don't have all the possible or interesting hardware combinations physical available.
That being the case its surprising that so many of these emulation projects have not stood the test of time or been endorsed/provided by GPU manufacturers.
There are some active emulation projects however.
Active GPU EMulation Projects
There are at least two active emulation projects maintained as of October 2022:
gpgpu-sim
oclgrind - openCL device simulator
I cannot speak to how good these are and how commonly they are used compared to using real GPUs (either your own or rented).
Honorable mentions
Cuda to OpenCL source to source transpilers.
These appear to be maintained but are not themselves emulators.
CU2CL
coriander
Why is this not a solved problem?
There are a number of challengs to overcome. My take on these would be something like:
provide a runtime emulating a particular version of the CUDA or openCL standard
provide a compiler targeting this runtime (ideally gcc or clang)
get the backing of a vendor (e.g. Nvidia or the kronos group)
get the backing of a community (i.e. a decent userbase and set of contributors)
build support into a popular emulation environment (e.g. virtualbox)
You could also argue the case that almost all people working in this area have access to real GPUs so this is not necessary at all.
The vendors of point 3 are doing well with points 1 and 2 and 4.
An emulator has to both build on that and take some mindshare of its own.
This is an uphill struggle. I hope and believe there will be success in the future.
Looking at virtualbox the last discussion I can find is from 2011.
https://forums.virtualbox.org/viewtopic.php?f=9&t=41155
Seemingly retired projects
These have been mentioned in answers to previous other attempts to ask and answer this kind of question.
gpuocelot - no longer maintained
mcuda - looks unmaintained
cuda-waste - on google code which was frozen long ago
nvemulate - cude emulator Nvidia - retired a while back
Other seemingly retired projects of interest:
openTPU - a Tensor PU emulator from 2017
gdev - 2010
Implementing Open-Source CUDA Runtime - paper from 2013
Earlier (out of date) questions:
GPU Emulator for CUDA programming without the hardware
Asked 2010 - most recent answer 2016
CUDA without CUDA enabled gpu
Asked 2010
How can I emulate a GPU for testing code written in Pytorch?
Asked 2021 - pytorch specific
CUDA code without a GPU
Asked 2014
CUDA on a system that has no GPU
Asked 2013
Using the built-in graphics cards without a NVIDIA graphics card, Can I use the CUDA and Caffe library?
Asked 2016

C/C++ Cross platform Library allowing the utilisation of GPU for floating point calculations

Does any one know of any cross platform c/c++ libraries which will utilise the GPU for the purposes of floating point calculations, not specifically graphics oriented calcs. Which ones are in common use, which ones recommended , which ones have you had experience of. Specifically it should be open source with a GPL license.
addendum:- Any libraries you know of that are not GPU manufacturer specific.
addendum:- OpenCL has been brought up in a few answers as having cross GPU compatability. Does anyone have experience using it and can vouch for it's maturity? I'm guessing that if it's Kronos it'll be pretty good.

I would very much doubt that you have a reasonable chance of finding something like this as open source, as "utilise GPU" usually implies "heftily hardware specific, top secret NDA driver stuff".
However, OpenCL is as cross platform as you can get (works with every major vendor and even has at least one software fallback implementation) and it is reasonably free insofar as there are no fees and no restrictions on how you may use it. The only non-free thing is that it's not open source and you can't modify it.
ATI/AMD and nVidia have been offering OpenCL working on G80 and RHD, respectively, for some time, also ATI/AMD has been offering a software implementaion for a good time. As for Intel, I remember reading that they were working for OpenCL for Sandy Bridge generation about a year or so ago, so it should probably be finished by now as well.

How about OpenCL?
Here is the project page at the Kronos Group.

It all depends on the chip you are targeting but NVIDIA offers an SDK in the form of CUDA for Windows, Mac, and Linux. The license is not opensource but depending on what you need that might not actually be a big hurdle.

Intel C++ compiler as an alternative to Microsoft's?

Is anyone here using the Intel C++ compiler instead of Microsoft's Visual c++ compiler?
I would be very interested to hear your experience about integration, performance and build times.

The Intel compiler is one of the most advanced C++ compiler available, it has a number of advantages over for instance the Microsoft Visual C++ compiler, and one major drawback. The advantages include:
Very good SIMD support, as far as I've been able to find out, it is the compiler that has the best support for SIMD instructions.
Supports both automatic parallelization (multi core optimzations), as well as manual (through OpenMP), and does both very well.
Support CPU dispatching, this is really important, since it allows the compiler to target the processor for optimized instructions when the program runs. As far as I can tell this is the only C++ compiler available that does this, unless G++ has introduced this in their yet.
It is often shipped with optimized libraries, such as math and image libraries.
However it has one major drawback, the dispatcher as mentioned above, only works on Intel CPU's, this means that advanced optimizations will be left out on AMD cpu's. There is a workaround for this, but it is still a major problem with the compiler.
To work around the dispatcher problem, it is possible to replace the dispatcher code produced with a version working on AMD processors, one can for instance use Agner Fog's asmlib library which replaces the compiler generated dispatcher function. Much more information about the dispatching problem, and more detailed technical explanations of some of the topics can be found in the Optimizing software in C++ paper - also from Anger (which is really worth reading).
On a personal note I have used the Intel c++ Compiler with Visual Studio 2005 where it worked flawlessly, I didn't experience any problems with microsoft specific language extensions, it seemed to understand those I used, but perhaps the ones mentioned by John Knoeller were different from the ones I had in my projects.
While I like the Intel compiler, I'm currently working with the microsoft C++ compiler, simply because of the financial extra investment the Intel compiler requires. I would only use the Intel compiler as an alternative to Microsofts or the GNU compiler, if performance were critical to my project and I had a the financial part in order ;)

I'm not using Intel C++ compiler at work / personal (I wish I would).
I would use it because it has:
Excellent inline assembler support. Intel C++ supports both Intel and AT&T (GCC) assembler syntaxes, for x86 and x64 platforms. Visual C++ can handle only Intel assembly syntax and only for x86.
Support for SSE3, SSSE3, and SSE4 instruction sets. Visual C++ has support for SSE and SSE2.
Is based on EDG C++, which has a complete ISO/IEC 14882:2003 standard implementation. That means you can use / learn every C++ feature.

I've had only one experience with this compiler, compiling STLPort. It took MSVC around 5 minutes to compile it and ICC was compiling for more than an hour. It seems that their template compilation is very slow. Other than this I've heard only good things about it.
Here's something interesting:
Intel's compiler can produce different
versions of pieces of code, with each
version being optimised for a specific
processor and/or instruction set
(SSE2, SSE3, etc.). The system detects
which CPU it's running on and chooses
the optimal code path accordingly; the
CPU dispatcher, as it's called.
"However, the Intel CPU dispatcher
does not only check which instruction
set is supported by the CPU, it also
checks the vendor ID string," Fog
details, "If the vendor string says
'GenuineIntel' then it uses the
optimal code path. If the CPU is not
from Intel then, in most cases, it
will run the slowest possible version
of the code, even if the CPU is fully
compatible with a better version."
OSnews article here

I tried using Intel C++ at my previous job. IIRC, it did indeed generate more efficient code at the expense of compilation time. We didn't put it to production use though, for reasons I can't remember.
One important difference compared to MSVC is that the Intel compiler supports C99.

Anecdotally, I've found that the Intel compiler crashes more frequently than Visual C++. Its diagnostics are a bit more thorough and clearly written than VC's. Thus, it's possible that the compiler will give diagnostics that weren't given with VC, or will crash where VC didn't, making your conversion more expensive.
However, I do believe that Intel's compiler allows you to link with Microsoft runtimes like the CRT, easing the transition cost.
If you are interoperating with managed code you should probably stick with Microsoft's compiler.
Recent Intel compilers achieve significantly better performance on floating-point heavy benchmarks, and are similar to Visual C++ on integer heavy benchmarks. However, it varies dramatically based on the program and whether or not you are using link-time code generation or profile-guided optimization. If performance is critical for you, you'll need to benchmark your application before making a choice. I'd only say that if you are doing scientific computing, it's probably worth the time to investigate.
Intel allows you a month-long free trial of its compiler, so you can try these things out for yourself.

I've been using the Intel C++ compiler since the first Release of Intel Parallel Studio, and so far I haven't felt the temptation to go back. Here's an outline of dis/advantages as well as (some obvious) observations.
Advantages
Parallelization (vectorization, OpenMP, SSE) is unmatched in other compilers.
Toolset is simply awesome. I'm talking about the profiling, of course.
Inclusion of optimized libraries such as Threading Building Blocks (okay, so Microsoft replicated TBB with PPL), Math Kernel Library (standard routines, and some implementations have MPI (!!!) support), Integrated Performance Primitives, etc. What's great also is that these libraries are constantly evolving.
Disadvantages
Speed-up is Intel-only. Well duh! It doesn't worry me, however, because on the server side all I have to do is choose Intel machines. I have no problem with that, some people might.
You can't really do OSS or anything like that on this, because the project file format is different. Yes, you can have both VS and IPS file formats, but that's just weird. You'll get lost in synchronising project options whenever you make a change. Intel's compiler has twice the number of options, by the way.
The compiler is a lot more finicky. It is far too easy to set incompatible project settings that will give you a cryptic compilation error instead of a nice meaningful explanation.
It costs additional money on top of Visual Studio.
Neutrals
I think that the performance argument is not a strong one anymore, because plenty of libraries such as Thrust or Microsoft AMP let you use GPGPU which will outgun your cpu anyway.
I recommend anyone interested to get a trial and try out some code, including the libraries. (And yes, the libraries are nice, but C-style interfaces can drive you insane.)

The last time the company I work for compared the two was about a year ago, (maybe 2). The Intel compiler generated faster code, usually only a bit faster, but in some cases quite a bit.
But it couldn't handle some of the MS language extensions that we depended on, so we ended up sticking with MS. It was VS 2005 that we were comparing it to. And I'm wracking my brain to remember exactly what MS extension the Intel compiler couldn't handle. I'll come back and edit this post if I can remember.

Intel C++ Compiler has AMAZING (human) support. Talking to Microsoft can literally take days. My non-trivial issue was solved through chat in under 10 minutes (including membership verification time).
EDIT: I have talked to Microsoft about problems in their products such as Office 2007, even got a bug reported. While I eventually succeeded, the overall size and complexity of their products and organization hierarchy is daunting.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js