I'm wondering if I can use SIMD intrinsics in a GPU code like a CUDA's kernel or openCL one. Is that possible?
No, SIMD intrinsics are just tiny wrappers for ASM code. They are CPU specific. More about them here.
Generally speking, why whould you do that? CUDA and OpenCL already contain many "functions" which are actually "GPU intrinsics" (all of these, for example, are single-point-math intrinsics for the GPU)
You use the vector data types built into the OpenCL C language. For example float4 or float8. If you run with the Intel or AMD device drivers these should get converted to SSE/AVX instructions of the vendor's OpenCL device driver. OpenCL includes several functions such as dot(v1, v2) which should use the SSE/AVX dot production instructions. Is there a particular intrinsic you are interested in that you don't think you can get from the OpenCL C language?
Mostly no, because GPU programming languages use different programming model (SIMT). However, AMD GPU do have an extension to OpenCL which provides intrinsics for some byte-granularity operations (thus allowing to pack 4 values into 32-bit GPU registers). These operations are intended for video processing.
Yes you can use SIMD intrinsics in the kernel code on CPU or GPU provided the compiler supports usage of these intrinsics.
Usually the better way to use SIMD will be using the Vector datatypes in the kernels so that the compiler decides to use SIMD based on the availablility, this make the kernel code portable as well.
Related
I'm working on a cross-platform parallel math library and I've made great progress implementing SSE, AVX, AVX2 and AVX-512 for x86/amd64 including runtime detection of ISA availability.
However, I've run into a major problem. There is no documentation for detecting NEON or Helium support at runtime on MSVC. It appears that there is no cpuid instruction on ARM or ARM64. It isn't clear whether there is a cross-platform way to accomplish this for Linux either.
Do you even need to detect it manually or can you just use preprocessor definitions (such as _M_ARM64) to check for runtime support? It is my understanding that preprocessor macros are ONLY evaluated at compile-time.
Are we just supposed to assume that every ARM CPU has NEON? What about Helium?
I'm hoping that someone here knows how to do it. Thank You in advance.
If building with MSVC, targeting modern Windows on ARM or ARM64 (i.e. not Windows CE), then the baseline feature set does support NEON (on both 32 and 64 bit), so you don't need to check for them at all, you can use them unconditionally. (If the code base is portable you might want to avoid compiling that code for other architectures of course, using e.g. regular preprocessor defines.) So for this case, checking the _M_ARM or _M_ARM64 defines is enough.
Helium is only for the M profile of ARM processors, i.e. for microcontrollers and such, not relevant for the the A profile (for "application use").
NEON as well as VFP is mandatory on armv8-a.
Hence there is no need to check the availability at runtime on aarch64.
And I'd ditch aarch32 support altogether.
I’m evaluating Intel IPP to speed up certain parts of our code, e.g.,
adding
absolute value
sorting
among others. I note this page in the manual:
While the rest of Intel IPP functions support only signals or images of 32-bit integer size, Intel IPP platform-aware functions work with 64-bit object sizes if it is supported by the target platform. … You can distinguish Intel IPP platform-aware functions by the L suffix in the function name, for example, ippiAdd_8u_C1RSfs_L. With Intel IPP platform-aware functions you can overcome 32-bit size limitations.
Of the three I mentioned above, it appears only sorting has 64-bit-aware functionality.
So, questions: can this be right? Can IPP not accelerate addition/abs on arrays beyond 32-bit indexing? Is there a master list of functions that have “platform-aware” (64-bit) alternatives in IPP? Do people hand-roll workarounds to the 32-bit limit, like calling the add/abs functions in a loop over 2^30-sized chunks?
LLVM has a back end for both AMD and NVIDIA GPUS. Is it currently possible to compile c++ (or a subset) to GPU code with clang and run it? Obviously things like the standard library would be unavailable, as well as operator new and delete. I'm not looking for OpenCL or CUDA, I'm thinking of a fully ahead-of-time compiled program, even a trivial one.
No, you need some language like OpenCL or CUDA, because a GPGPU is not an ordinary computer and has a different programming model (grossly speaking, SIMD like). GPGPU compute kernels have specific constraints.
You might want to consider using OpenACC pragmas in your C++ code (and use a recent GCC compiler).
I have seen OpenCL is widely supported by CPU implementation as well as some GPU implementations.
For the cases there is a GPU, but no GPU implementation available, would it be feasable to implement OpenCL using OpenGL?
Maybe most operations would map quite well to GLSL fragment shader or even compute shaders.
If this is feasable, where to begin? Is there any 'minimal' CPU OpenCL implementation, that one could start off?
For the cases there is a GPU, but no GPU implementation available, would it be feasable to implement OpenCL using OpenGL?
Possible: Yes, certainly. Every Turing complete machine can emulate any other Turing complete machine.
Feasible: OpenGL implementations' GLSL compilers are already primadonnas, each implementation's compiler behaving a little different. OpenGL itself has tons of heuristics in it to select the codepath. Shoehorning a makeshift OpenCL on top of OpenGL + GLSL will be an exercise in lots of pain.
Required: Absolutely not: For every GPU which has the required capabilities to actually support OpenCL the drivers do support OpenCL anyway, so this is a moot non-issue.
OpenCL has certain very explicit requirements on the precision of floating-point operations. GLSL's requirements are far more lax. So even if you implemented the OpenCL API on top of OpenGL, you would never be able to get the same behavior.
Oh and in case you start thinking otherwise, Vulkan is no better in this regard.
I wrote this small subroutine that compares simple vector mathematical functions, performed either with a loop:
f(i) = a(i) + b(i)
or direct:
f = a + b
or using Intel MKL VML:
vdAdd(n,a,b,f)
The timing results for n=50000000 are:
VML 0.9 sec
direct 0.4
loop 0.4
And I dont understand, why VML takes twice as long as the other methods!
(Loop is sometimes faster than direct)
Subroutine can be found under http://paste.ideaslabs.com/show/L6dVLdAOIf
and called via
program test
use vmltests
implicit none
call vmlTest()
end program
Your sample code have potential L2 cache issue, one can overcome it with blocking optimization. See Intel® Software Networks Forum answer for details: http://software.intel.com/en-us/forums/showthread.php?t=80041
Intel® Optimization Notice:
Intel® compilers, associated libraries
and associated development tools may
include or utilize options that
optimize for instruction sets that are
available in both Intel® and non-Intel
microprocessors (for example SIMD
instruction sets), but do not optimize
equally for non-Intel microprocessors.
In addition, certain compiler options
for Intel compilers, including some
that are not specific to Intel
micro-architecture, are reserved for
Intel microprocessors. For a detailed
description of Intel compiler options,
including the instruction sets and
specific microprocessors they
implicate, please refer to the “Intel®
Compiler User and Reference Guides”
under “Compiler Options." Many
library routines that are part of
Intel® compiler products are more
highly optimized for Intel
microprocessors than for other
microprocessors. While the compilers
and libraries in Intel® compiler
products offer optimizations for both
Intel and Intel-compatible
microprocessors, depending on the
options you select, your code and
other factors, you likely will get
extra performance on Intel
microprocessors.
Intel® compilers, associated libraries
and associated development tools may
or may not optimize to the same degree
for non-Intel microprocessors for
optimizations that are not unique to
Intel microprocessors. These
optimizations include Intel® Streaming
SIMD Extensions 2 (Intel® SSE2),
Intel® Streaming SIMD Extensions 3
(Intel® SSE3), and Supplemental
Streaming SIMD Extensions 3 (Intel®
SSSE3) instruction sets and other
optimizations. Intel does not
guarantee the availability,
functionality, or effectiveness of any
optimization on microprocessors not
manufactured by Intel.
Microprocessor-dependent optimizations
in this product are intended for use
with Intel microprocessors.
While Intel believes our compilers and
libraries are excellent choices to
assist in obtaining the best
performance on Intel® and non-Intel
microprocessors, Intel recommends that
you evaluate other compilers and
libraries to determine which best meet
your requirements. We hope to win
your business by striving to offer the
best performance of any compiler or
library; please let us know if you
find we do not