Intel VML add is slow

Intel VML add is slow - fortran

I wrote this small subroutine that compares simple vector mathematical functions, performed either with a loop:
f(i) = a(i) + b(i)
or direct:
f = a + b
or using Intel MKL VML:
vdAdd(n,a,b,f)
The timing results for n=50000000 are:
VML 0.9 sec
direct 0.4
loop 0.4
And I dont understand, why VML takes twice as long as the other methods!
(Loop is sometimes faster than direct)
Subroutine can be found under http://paste.ideaslabs.com/show/L6dVLdAOIf
and called via
program test
use vmltests
implicit none
call vmlTest()
end program

Your sample code have potential L2 cache issue, one can overcome it with blocking optimization. See Intel® Software Networks Forum answer for details: http://software.intel.com/en-us/forums/showthread.php?t=80041
Intel® Optimization Notice:
Intel® compilers, associated libraries
and associated development tools may
include or utilize options that
optimize for instruction sets that are
available in both Intel® and non-Intel
microprocessors (for example SIMD
instruction sets), but do not optimize
equally for non-Intel microprocessors.
In addition, certain compiler options
for Intel compilers, including some
that are not specific to Intel
micro-architecture, are reserved for
Intel microprocessors. For a detailed
description of Intel compiler options,
including the instruction sets and
specific microprocessors they
implicate, please refer to the “Intel®
Compiler User and Reference Guides”
under “Compiler Options." Many
library routines that are part of
Intel® compiler products are more
highly optimized for Intel
microprocessors than for other
microprocessors. While the compilers
and libraries in Intel® compiler
products offer optimizations for both
Intel and Intel-compatible
microprocessors, depending on the
options you select, your code and
other factors, you likely will get
extra performance on Intel
microprocessors.
Intel® compilers, associated libraries
and associated development tools may
or may not optimize to the same degree
for non-Intel microprocessors for
optimizations that are not unique to
Intel microprocessors. These
optimizations include Intel® Streaming
SIMD Extensions 2 (Intel® SSE2),
Intel® Streaming SIMD Extensions 3
(Intel® SSE3), and Supplemental
Streaming SIMD Extensions 3 (Intel®
SSSE3) instruction sets and other
optimizations. Intel does not
guarantee the availability,
functionality, or effectiveness of any
optimization on microprocessors not
manufactured by Intel.
Microprocessor-dependent optimizations
in this product are intended for use
with Intel microprocessors.
While Intel believes our compilers and
libraries are excellent choices to
assist in obtaining the best
performance on Intel® and non-Intel
microprocessors, Intel recommends that
you evaluate other compilers and
libraries to determine which best meet
your requirements. We hope to win
your business by striving to offer the
best performance of any compiler or
library; please let us know if you
find we do not

Related

How do I Detect Availability of NEON and Helium Instruction Sets at Runtime

I'm working on a cross-platform parallel math library and I've made great progress implementing SSE, AVX, AVX2 and AVX-512 for x86/amd64 including runtime detection of ISA availability.
However, I've run into a major problem. There is no documentation for detecting NEON or Helium support at runtime on MSVC. It appears that there is no cpuid instruction on ARM or ARM64. It isn't clear whether there is a cross-platform way to accomplish this for Linux either.
Do you even need to detect it manually or can you just use preprocessor definitions (such as _M_ARM64) to check for runtime support? It is my understanding that preprocessor macros are ONLY evaluated at compile-time.
Are we just supposed to assume that every ARM CPU has NEON? What about Helium?
I'm hoping that someone here knows how to do it. Thank You in advance.

If building with MSVC, targeting modern Windows on ARM or ARM64 (i.e. not Windows CE), then the baseline feature set does support NEON (on both 32 and 64 bit), so you don't need to check for them at all, you can use them unconditionally. (If the code base is portable you might want to avoid compiling that code for other architectures of course, using e.g. regular preprocessor defines.) So for this case, checking the _M_ARM or _M_ARM64 defines is enough.
Helium is only for the M profile of ARM processors, i.e. for microcontrollers and such, not relevant for the the A profile (for "application use").

NEON as well as VFP is mandatory on armv8-a.
Hence there is no need to check the availability at runtime on aarch64.
And I'd ditch aarch32 support altogether.

AVX support for remainder in G++ 5.4.0 [duplicate]

I can't seem to find the intrinsics for either _mm_pow_ps or _mm256_pow_ps, both of which are supposed to be included with 'immintrin.h'.
Does Clang not define these or are they in a header I'm not including?

That's not an intrinsic; it's an Intel SVML library function name that confusingly uses the same naming scheme as actual intrinsics. There's no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction...)
IDK if this naming scheme is to trick people into depending on Intel tools when writing SIMD code with their compiler (which comes with SVML), or because their compiler does treat it like an intrinsic/builtin for doing constant propagation if inputs are known, or some other reason.
For functions like that and _mm_sin_ps to be usable, you need Intel's Short Vector Math Library (SVML). Most people just avoid using them. If it has an implementation of something you want, though, it's worth looking into. IDK what other vector pow implementations exist.
In the intrinsics finder, you can avoid seeing these non-portable functions in your search results if you leave the SVML box unchecked.
There are some "composite" intrinsics like _mm_set_epi8() that typically compile to multiple loads and shuffles which are portable across compilers, and do inline instead of being calls to library functions.
Also note that sqrtps is a native machine instruction, so _mm_sqrt_ps() is a real intrinsic. IEEE 754 specifies mul, div, add, sub, and sqrt as "basic" operations that are requires to produce correctly-rounded results (error <= 0.5ulp), so sqrt() is special and does have direct hardware support, unlike most other "math library" functions.
There are various libraries of SIMD math functions. Some of them come with C++ wrapper libraries that allow a+b instead of _mm_add_ps(a,b).
glibc libmvec - since glibc 2.22, to support OpenMP 4.0 vector math functions. GCC knows how to auto-vectorize some functions like cos(), sin(), and probably pow() using it. This answer shows one inconvenient way of using it explicitly for manual vectorization. (Hopefully better ways are possible that don't have mangled names in the source code).
Agner Fog's VCL has some math functions like exp and log. (Formerly GPL licensed, now Apache).
https://github.com/microsoft/DirectXMath (MIT license) - I think portable to non-Windows, and doesn't require DirectX.
https://sleef.org/ - apparently great performance, with variable accuracy you can choose. Formerly only supported on MSVC on Windows, the support matrix on its web site now includes GCC and Clang for x86-64 GNU/Linux and AArch64.
Intel's own SVML (comes with ICC; ICC auto-vectorizes with SVML by default). Confusingly has its prototypes in immintrin.h along with actual intrinsics. Maybe they want to trick people into writing code that's dependent on Intel tools/libraries. Or maybe they think fewer includes are better and that everyone should use their compiler...
Also related: Intel MKL (Math Kernel Library), with matrix BLAS functions.
AMD ACML - end-of-life closed-source freeware. I think it just has functions that loop over arrays/matrices (like Intel MKL), not functions for single SIMD vectors.
sse_mathfun (zlib license) SSE2 and ARM NEON. Hasn't been updated since about 2011 it seems. But does have implementations of single-vector math / trig functions.

Does gcc use Intel's SSE 4.2 instructions for text processing if available?

I read here that Intel introduced SSE 4.2 instructions for accelerating string processing.
Quote from the article:
The SSE 4.2 instruction set, first implemented in Intel's Core i7,
provides string and text processing instructions (STTNI) that utilize
SIMD operations for processing character data. Though originally
conceived for accelerating string, text, and XML processing, the
powerful new capabilities of these instructions are useful outside of
these domains, and it is worth revisiting the search and recognition
stages of numerous applications to utilize STTNI to improve
performance
Does gcc make use of these instructions if they are available?
If so, which version?
If it doesn't, are there any open source libraries
which offer this?

In regards to software libraries I would look at Agner Fog's asmlib. It has a collection of many routines, including several string manipulation ones which use SSE4.2, optimized in assembly. Some other useful functions it provides which I use return information on the CPU such as the cache size for each level and which extensions (e.g. SSE4.2) are supported.
http://www.agner.org/optimize/asmlib.zip
To enable SSE4.2 in GCC compile with -msse4.2 or if you have a processor with AVX use -mavx

I'm not sure about whether gcc uses that, but it shouldn't matter as text processing is generally done through glibc. If you use the standard string functions from string.h (probably cstring will do the same), and have a reasonable glibc you should be using them automatically.
I have searched for it and it seems glibc 2.15 (possibly even older ones have it) already has SSE4.2 strcasecmp optimizations:
http://upstream.rosalinux.ru/changelogs/glibc/2.15/changelog.html

SIMD intrinsics - are they usable on gpus?

I'm wondering if I can use SIMD intrinsics in a GPU code like a CUDA's kernel or openCL one. Is that possible?

No, SIMD intrinsics are just tiny wrappers for ASM code. They are CPU specific. More about them here.
Generally speking, why whould you do that? CUDA and OpenCL already contain many "functions" which are actually "GPU intrinsics" (all of these, for example, are single-point-math intrinsics for the GPU)

You use the vector data types built into the OpenCL C language. For example float4 or float8. If you run with the Intel or AMD device drivers these should get converted to SSE/AVX instructions of the vendor's OpenCL device driver. OpenCL includes several functions such as dot(v1, v2) which should use the SSE/AVX dot production instructions. Is there a particular intrinsic you are interested in that you don't think you can get from the OpenCL C language?

Mostly no, because GPU programming languages use different programming model (SIMT). However, AMD GPU do have an extension to OpenCL which provides intrinsics for some byte-granularity operations (thus allowing to pack 4 values into 32-bit GPU registers). These operations are intended for video processing.

Yes you can use SIMD intrinsics in the kernel code on CPU or GPU provided the compiler supports usage of these intrinsics.
Usually the better way to use SIMD will be using the Vector datatypes in the kernels so that the compiler decides to use SIMD based on the availablility, this make the kernel code portable as well.

Intel C++ compiler as an alternative to Microsoft's?

Is anyone here using the Intel C++ compiler instead of Microsoft's Visual c++ compiler?
I would be very interested to hear your experience about integration, performance and build times.

The Intel compiler is one of the most advanced C++ compiler available, it has a number of advantages over for instance the Microsoft Visual C++ compiler, and one major drawback. The advantages include:
Very good SIMD support, as far as I've been able to find out, it is the compiler that has the best support for SIMD instructions.
Supports both automatic parallelization (multi core optimzations), as well as manual (through OpenMP), and does both very well.
Support CPU dispatching, this is really important, since it allows the compiler to target the processor for optimized instructions when the program runs. As far as I can tell this is the only C++ compiler available that does this, unless G++ has introduced this in their yet.
It is often shipped with optimized libraries, such as math and image libraries.
However it has one major drawback, the dispatcher as mentioned above, only works on Intel CPU's, this means that advanced optimizations will be left out on AMD cpu's. There is a workaround for this, but it is still a major problem with the compiler.
To work around the dispatcher problem, it is possible to replace the dispatcher code produced with a version working on AMD processors, one can for instance use Agner Fog's asmlib library which replaces the compiler generated dispatcher function. Much more information about the dispatching problem, and more detailed technical explanations of some of the topics can be found in the Optimizing software in C++ paper - also from Anger (which is really worth reading).
On a personal note I have used the Intel c++ Compiler with Visual Studio 2005 where it worked flawlessly, I didn't experience any problems with microsoft specific language extensions, it seemed to understand those I used, but perhaps the ones mentioned by John Knoeller were different from the ones I had in my projects.
While I like the Intel compiler, I'm currently working with the microsoft C++ compiler, simply because of the financial extra investment the Intel compiler requires. I would only use the Intel compiler as an alternative to Microsofts or the GNU compiler, if performance were critical to my project and I had a the financial part in order ;)

I'm not using Intel C++ compiler at work / personal (I wish I would).
I would use it because it has:
Excellent inline assembler support. Intel C++ supports both Intel and AT&T (GCC) assembler syntaxes, for x86 and x64 platforms. Visual C++ can handle only Intel assembly syntax and only for x86.
Support for SSE3, SSSE3, and SSE4 instruction sets. Visual C++ has support for SSE and SSE2.
Is based on EDG C++, which has a complete ISO/IEC 14882:2003 standard implementation. That means you can use / learn every C++ feature.

I've had only one experience with this compiler, compiling STLPort. It took MSVC around 5 minutes to compile it and ICC was compiling for more than an hour. It seems that their template compilation is very slow. Other than this I've heard only good things about it.
Here's something interesting:
Intel's compiler can produce different
versions of pieces of code, with each
version being optimised for a specific
processor and/or instruction set
(SSE2, SSE3, etc.). The system detects
which CPU it's running on and chooses
the optimal code path accordingly; the
CPU dispatcher, as it's called.
"However, the Intel CPU dispatcher
does not only check which instruction
set is supported by the CPU, it also
checks the vendor ID string," Fog
details, "If the vendor string says
'GenuineIntel' then it uses the
optimal code path. If the CPU is not
from Intel then, in most cases, it
will run the slowest possible version
of the code, even if the CPU is fully
compatible with a better version."
OSnews article here

I tried using Intel C++ at my previous job. IIRC, it did indeed generate more efficient code at the expense of compilation time. We didn't put it to production use though, for reasons I can't remember.
One important difference compared to MSVC is that the Intel compiler supports C99.

Anecdotally, I've found that the Intel compiler crashes more frequently than Visual C++. Its diagnostics are a bit more thorough and clearly written than VC's. Thus, it's possible that the compiler will give diagnostics that weren't given with VC, or will crash where VC didn't, making your conversion more expensive.
However, I do believe that Intel's compiler allows you to link with Microsoft runtimes like the CRT, easing the transition cost.
If you are interoperating with managed code you should probably stick with Microsoft's compiler.
Recent Intel compilers achieve significantly better performance on floating-point heavy benchmarks, and are similar to Visual C++ on integer heavy benchmarks. However, it varies dramatically based on the program and whether or not you are using link-time code generation or profile-guided optimization. If performance is critical for you, you'll need to benchmark your application before making a choice. I'd only say that if you are doing scientific computing, it's probably worth the time to investigate.
Intel allows you a month-long free trial of its compiler, so you can try these things out for yourself.

I've been using the Intel C++ compiler since the first Release of Intel Parallel Studio, and so far I haven't felt the temptation to go back. Here's an outline of dis/advantages as well as (some obvious) observations.
Advantages
Parallelization (vectorization, OpenMP, SSE) is unmatched in other compilers.
Toolset is simply awesome. I'm talking about the profiling, of course.
Inclusion of optimized libraries such as Threading Building Blocks (okay, so Microsoft replicated TBB with PPL), Math Kernel Library (standard routines, and some implementations have MPI (!!!) support), Integrated Performance Primitives, etc. What's great also is that these libraries are constantly evolving.
Disadvantages
Speed-up is Intel-only. Well duh! It doesn't worry me, however, because on the server side all I have to do is choose Intel machines. I have no problem with that, some people might.
You can't really do OSS or anything like that on this, because the project file format is different. Yes, you can have both VS and IPS file formats, but that's just weird. You'll get lost in synchronising project options whenever you make a change. Intel's compiler has twice the number of options, by the way.
The compiler is a lot more finicky. It is far too easy to set incompatible project settings that will give you a cryptic compilation error instead of a nice meaningful explanation.
It costs additional money on top of Visual Studio.
Neutrals
I think that the performance argument is not a strong one anymore, because plenty of libraries such as Thrust or Microsoft AMP let you use GPGPU which will outgun your cpu anyway.
I recommend anyone interested to get a trial and try out some code, including the libraries. (And yes, the libraries are nice, but C-style interfaces can drive you insane.)

The last time the company I work for compared the two was about a year ago, (maybe 2). The Intel compiler generated faster code, usually only a bit faster, but in some cases quite a bit.
But it couldn't handle some of the MS language extensions that we depended on, so we ended up sticking with MS. It was VS 2005 that we were comparing it to. And I'm wracking my brain to remember exactly what MS extension the Intel compiler couldn't handle. I'll come back and edit this post if I can remember.

Intel C++ Compiler has AMAZING (human) support. Talking to Microsoft can literally take days. My non-trivial issue was solved through chat in under 10 minutes (including membership verification time).
EDIT: I have talked to Microsoft about problems in their products such as Office 2007, even got a bug reported. While I eventually succeeded, the overall size and complexity of their products and organization hierarchy is daunting.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js