AVX support for remainder in G++ 5.4.0 [duplicate] - c++

I can't seem to find the intrinsics for either _mm_pow_ps or _mm256_pow_ps, both of which are supposed to be included with 'immintrin.h'.
Does Clang not define these or are they in a header I'm not including?

That's not an intrinsic; it's an Intel SVML library function name that confusingly uses the same naming scheme as actual intrinsics. There's no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction...)
IDK if this naming scheme is to trick people into depending on Intel tools when writing SIMD code with their compiler (which comes with SVML), or because their compiler does treat it like an intrinsic/builtin for doing constant propagation if inputs are known, or some other reason.
For functions like that and _mm_sin_ps to be usable, you need Intel's Short Vector Math Library (SVML). Most people just avoid using them. If it has an implementation of something you want, though, it's worth looking into. IDK what other vector pow implementations exist.
In the intrinsics finder, you can avoid seeing these non-portable functions in your search results if you leave the SVML box unchecked.
There are some "composite" intrinsics like _mm_set_epi8() that typically compile to multiple loads and shuffles which are portable across compilers, and do inline instead of being calls to library functions.
Also note that sqrtps is a native machine instruction, so _mm_sqrt_ps() is a real intrinsic. IEEE 754 specifies mul, div, add, sub, and sqrt as "basic" operations that are requires to produce correctly-rounded results (error <= 0.5ulp), so sqrt() is special and does have direct hardware support, unlike most other "math library" functions.
There are various libraries of SIMD math functions. Some of them come with C++ wrapper libraries that allow a+b instead of _mm_add_ps(a,b).
glibc libmvec - since glibc 2.22, to support OpenMP 4.0 vector math functions. GCC knows how to auto-vectorize some functions like cos(), sin(), and probably pow() using it. This answer shows one inconvenient way of using it explicitly for manual vectorization. (Hopefully better ways are possible that don't have mangled names in the source code).
Agner Fog's VCL has some math functions like exp and log. (Formerly GPL licensed, now Apache).
https://github.com/microsoft/DirectXMath (MIT license) - I think portable to non-Windows, and doesn't require DirectX.
https://sleef.org/ - apparently great performance, with variable accuracy you can choose. Formerly only supported on MSVC on Windows, the support matrix on its web site now includes GCC and Clang for x86-64 GNU/Linux and AArch64.
Intel's own SVML (comes with ICC; ICC auto-vectorizes with SVML by default). Confusingly has its prototypes in immintrin.h along with actual intrinsics. Maybe they want to trick people into writing code that's dependent on Intel tools/libraries. Or maybe they think fewer includes are better and that everyone should use their compiler...
Also related: Intel MKL (Math Kernel Library), with matrix BLAS functions.
AMD ACML - end-of-life closed-source freeware. I think it just has functions that loop over arrays/matrices (like Intel MKL), not functions for single SIMD vectors.
sse_mathfun (zlib license) SSE2 and ARM NEON. Hasn't been updated since about 2011 it seems. But does have implementations of single-vector math / trig functions.

Related

c++ parallel std::sort for floating values

I've a large file consisting of > millions of floating point values. I can easily sort them using std::sort by reading file into vector for now, eg -
std::vector<float> v;
std::sort(v.begin(), v.end());
but is there any version of std::sort or similar algorithm which takes advantage of multiple cores available on my system? Since this is the only task that takes much time setting up, I'm looking for perf improvements from having > 1 core cpu.
I can use any latest releases of compilers on a x64 linux server and can compile the binary with -std=c++1z too.
You're in luck. The C++ Extensions for Parallelism Technical Specification added parallelized versions of the many of the standard algorithms including std::sort. They are available in C++17. GCC has support for this and you can see their page about it here. It looks as though they are utilizing OpenMP for multi-threading.
GCC Prerequisite Compiler Flags
Any use of parallel functionality requires additional compiler and runtime support, in particular support for OpenMP. Adding this support is not difficult: just compile your application with the compiler flag -fopenmp. This will link in libgomp, the GNU Offloading and Multi Processing Runtime Library, whose presence is mandatory.
In addition, hardware that supports atomic operations and a compiler capable of producing atomic operations is mandatory: GCC defaults to no support for atomic operations on some common hardware architectures. Activating atomic operations may require explicit compiler flags on some targets (like sparc and x86), such as -march=i686, -march=native or -mcpu=v9. See the GCC manual for more information.
I know you said you are using Linux but I also want to included that it appears MSVS, starting with version 2013 RTM, also has support for the Parallelism Technical Specification.

Why do libraries need hard coded vectorizion instead of compiler auto vectorization

C++ eigen library does vectorization for different architecture, like SSE, NEON etc. In their documentation they mentioned that, Eigen vectorization is not compiler dependent. But most modern compilers like gcc does the vectorization automatically if the vectorization flag is enabled using -O3 flag.
So my question is, why Eigen or any other libraries does hard coded vectorization when compilers does this automatically for us?
It is true that compilers are getting better and better at auto-vectorization, and for basic coefficient-wise operations like 2*A-4*B a library like Eigen cannot do much better than recent compilers. However, for slightly more complicated expressions like matrix products, reductions, transposition, powers, etc. the compiler cannot do much. On the other hand, Eigen can take advantage of the higher level knowledge of the expression semantic to explicitly vectorize them. Moreover, complex scalar types are not vectorized by compilers. You can check by yourself by disabling Eigen's explicit vectorization (-DEIGEN_DONT_VECTORIZE).

Generalizing to multiple BLAS/LAPACK Libraries

I am developing a linear algebra tool in C++, which relies heavily on matrix multiplication and decompositions (like LU, SVD), and is meant to be applied to large matrices. I developed it using Intel MKL for peak performance, but I don't want to release an Intel MKL only version, as I assume it will not work for people without Intel or who don't want to install MKL. Instead, I should release a more general code that is not Intel MKL-specific, but rather allows the user to specify which implementation of BLAS and LAPACK they would like to use (e.g. OpenBLAS, or ATLAS).
Although the function prototypes seem to be the same across implementations, there are several (helper?) functions and types that are specific to Intel MKL. For example, there is the MKL_INT type that I use, and also the mkl_malloc. This article suggests using macros to redefine the types, which was also my first thought. I assume I would also then have macros for the headers as well.
I believe it is standard for code to be written such that it is agnostic to the BLAS/LAPACK implementation, and I wanted to know if there was a cleaner way than relying on macros--particularly since the latter would require recompiling the code to switch, which does not seem to be necessary for other tools I have used.
Most scientific codes that rely on BLAS/LAPACK calls are implementation-agnostic. They usually require that the library is just linked as appropriate.
You've commented that the function prototypes are the same across implementations. This allows you to just have the prototypes in some myblas.h and mylapack.h headers then link whichever library you'd like to use.
It sounds like your primary concern is the implementation-specific stuff that you've utilized for MKL. The solution is to just not use this stuff. For example, the MKL types like MKL_INT are not special. They are C datatypes that have been defined to allow generalize between LP32/LP64/ILP64 libraries which MKL provides. See this table.
Also, stuff like mkl_malloc isn't special. It was introduced before the C standard had a thread-safe aligned alloc. In fact, that is all mkl_malloc is. So instead, just use aligned_alloc, or if you don't want to commit to C11 use _mm_malloc, memalign, etc...
On the other hand, MKL does provide some useful extensions to BLAS/LAPACK which aren't standardized (like transpositions, for example). However, this type of stuff is usually easy to implement with a special case BLAS/LAPACK call or easy enough to implement by yourself. MKL also has internal threading if you choose to use it, however, many BLAS/LAPACK libraries offer this.

Does gcc use Intel's SSE 4.2 instructions for text processing if available?

I read here that Intel introduced SSE 4.2 instructions for accelerating string processing.
Quote from the article:
The SSE 4.2 instruction set, first implemented in Intel's Core i7,
provides string and text processing instructions (STTNI) that utilize
SIMD operations for processing character data. Though originally
conceived for accelerating string, text, and XML processing, the
powerful new capabilities of these instructions are useful outside of
these domains, and it is worth revisiting the search and recognition
stages of numerous applications to utilize STTNI to improve
performance
Does gcc make use of these instructions if they are available?
If so, which version?
If it doesn't, are there any open source libraries
which offer this?
In regards to software libraries I would look at Agner Fog's asmlib. It has a collection of many routines, including several string manipulation ones which use SSE4.2, optimized in assembly. Some other useful functions it provides which I use return information on the CPU such as the cache size for each level and which extensions (e.g. SSE4.2) are supported.
http://www.agner.org/optimize/asmlib.zip
To enable SSE4.2 in GCC compile with -msse4.2 or if you have a processor with AVX use -mavx
I'm not sure about whether gcc uses that, but it shouldn't matter as text processing is generally done through glibc. If you use the standard string functions from string.h (probably cstring will do the same), and have a reasonable glibc you should be using them automatically.
I have searched for it and it seems glibc 2.15 (possibly even older ones have it) already has SSE4.2 strcasecmp optimizations:
http://upstream.rosalinux.ru/changelogs/glibc/2.15/changelog.html

Are BLAS Level 1 procedures still relevant for modern fortran compilers?

Most of the BLAS Level 1 API can be trivially written straight forward using Fortran 9x+ vectorized assignments and intrinsic procedures.
Assuming you are using a modern optimizing compiler, like Intel Fortran, and correct target-specific compiler optimization options, are there any performance benefits from using BLAS Level 1 procedures instead, say from Intel MKL or other fast BLAS implementations?
If there are, what is a typical vector size when these benefits appear?
It depends. We've tested this before with the Intel compiler and run into surprising results. For example, DOT_PRODUCT from Fortran vs. the BLAS implementation gave different trends based on the problem size. As the number of elements in the arrays got larger, BLAS became better than the intrinsic. But for small problem sizes, the intrinsic was much faster.
We actually measured for our use cases what the cut-off size that's required to make one better than the other and actually use if-statements to decide which to call. I can't share those results, but I encourage you to test it out yourself. There is still benefit from using BLAS.