I am trying to utilise some AVX intrinsics in my code and have run into a brick wall with the logarithm intrinsics.
Using the Intel Intrinsics Guide v3.0.1 for Linux, I see the intrinsic _mm256_log_ps(__m256) listed as being part of "immintrin.h" and also supported on my current arch.
However trying to compile this simple test case fails with "error: ‘_mm256_log_ps’ was not declared in this scope"
The example was compiled with g++-4.8 -march=native -mavx test.cpp
#include <immintrin.h>
int main()
{
__m256 i;
_mm256_log_ps(i);
}
Am I missing something fundamental here? Are certain intrinsics not supported by g++ and only available in icc?
SOLVED: This instruction is not a true intrinsic but instead implemented as part of the Intel SVML for ICC.
As indicated in the comments to your question, that intrinsic doesn't map to an actual AVX instruction; it is an Intel extension to the intrinsic set. The implementation likely uses many underlying instructions, as a logarithm isn't a trivial operation.
If you'd like to use a non-Intel compiler but want a fast logarithm implementation, you might check out this open-source implementation of sin(), cos(), exp(), and log() functions using AVX. They are based on an earlier SSE2 version of the same functions.
I've posted my implementation of _mm256_log_pd(__m256d) here: https://stackoverflow.com/a/45898937/1915854 . With some effort you should be able to extend it to 8 packed floats instead of 4 doubles, though you need to revise the bit manipulations. And some parts are easies because you don't need to repack odd-/even-numbered 32-bit components of __m256i into __m128i.
Related
I keep reading opinions on which header file is better to include to access Intel's intrinsics : x86intrin.h or immintrin.h .
Both seem to achieve an identical outcome, but I'm sure there must some subtle differences, with regards to code portability. Maybe one is more common, or more complete, than the other ?
I couldn't find an explanation on any of them. If anyone knows why there are 2 files, and what differences they have, this would be a welcomed SO answer.
Speaking of portability, for older compilers (like gcc < v4.4.0), of course things become more complex, and neither is available. One has to consider including another intrinsic header (likely emmintrin.h for SSE support).
(posting an answer here because Header files for x86 SIMD intrinsics has out of date answers that suggest including individual header files).
immintrin.h is portable across all compilers, and includes all Intel SIMD intrinsics, and some scalar extensions like _pdep_u32 that are available with -mbmi2 or a -march= that includes it. (For AMD SSE4a and XOP (Bulldozer-family only, dropped for Zen), you need to include a different header as well.)
The only reason I can think of for including <emmintrin.h> specifically would be if you're using MSVC and want to leave intrinsics undefined for ISA extensions you don't want to depend on.
GCC's model of requiring you to enable extensions before you can use intrinsics for them means the compiler does this checking for you, so you can just #include <immintrin.h> but still get an error if you try to use _mm_shuffle_epi8 (pshufb) without -mssse3.
Don't use compilers older than gcc4.4. They're obsolete and will typically generate slower code, especially for modern CPUs that didn't exist when their tuning settings were being decided.
gcc/clang's x86intrin.h vs. MSVC intrin.h are only useful if you need some extra non-SIMD intrinsics like MSVC's _BitScanReverse() that aren't always portable across compilers. Stuff like integer rotate / bit-scan intrinsics that are baseline (unlike BMI1 lzcnt/tzcnt or BMI2 rorx) but hard or impossible to express in C in a way that compilers will recognize and turn a loop back into a single instruction.
Intel documents some of those as being available in immintrin.h in their intrinsics guide, but gcc/clang and MSVC actually have them in their x86intrin.h or intrin.h headers, respectively.
See How to get the CPU cycle count in x86_64 from C++? for an example of using #ifdef _MSC_VER to choose the right header to define uint64_t __rdtsc(void) and __rdtscp().
I compiled my C++ project(using Eigen 3.2.8) with the EIGEN_USE_BLAS option and link against MKL-BLAS, every thing works fine and that indeed speeds up my program substantially(perhaps due to a lot of complex-valued matrix-vector multiplication)
Then I also tried the EIGEN_USE_MKL_ALL, however, some similar errors prompt up:
/eigen3/Eigen/src/QR/ColPivHouseholderQR_MKL.h:94:1 error:
Cannot convert "Eigen::PlainObjectBase<Eigen::matrix<int,-1,1>>::Scalar*
{aka int*}" to "long long int*" in initialization
EIGEN_MKL_OR_COLPIV(...)
Two questions here:
EIGEN_USE_BLAS enables a 4x speed up though I didn't expect that much, possible reason?
EIGEN_USE_MKL_ALL seems to have some type conflict with LAPACK stuff, how to fix the compiling error?
MKL utilizes new AVX/AVX2 instruction set (8 32-bit float operations per clock with FMA and 3-operand instructions), while Eigen 3.2.8 only supports up to SSE4 (4 32-bit float operations per clock). As indicated by ggael, you could update to 3.3beta1 to achieve better performance.
You could try Eigen 3.3-beta1. Currently I cannot reproduce your problem. You may want to provide your code sample and compile option. But based on your error message, I guess you are using ILP64 interface, which is not supported by Eigen. You could use LP64 instead.
To complete kangshiyin answer, Eigen 3.3 supports AVX/FMA and can thus achieve similar performance. You need to compile with AVX and FMA instructions enabled. For instance with GCC, clang, or ICC: -mavx -mfma.
Is it possible to do it now in D out of the box ? I'm using LDC2 compiler if it can help.
I'm interested using AVX intrinsics.
At the moment DMD has no AVX intrinsics. Considering that all D compilers use the DMD frontend, and the druntime and phobos, I would say that the only way to do what you want is to use the in-line assembly as suggested by BCS.
I would advise you to check from time to time the core.simd module and see if AVX intrinsics are added.
There is inline ASM. I think DMD supports the SIMD instructions. Not sure what the story for LDC is.
With LDC, module ldc.gccbuiltins_x86 contains GCC-style builtins like __builtin_ia32_vfnmaddps256.
(there is also ldc.gccbuiltins_arm, and ldc.gccbuiltins_ppc, ...)
Are the following functions executed in a single clock cycle?
__builtin_popcount
__builtin_ctz
__builtin_clz
also what is the no of clock cycles for the ll(64 bit) version of the same.
are they portable. why or why not?
Do these functions execute in a single clock-cycle?
Not necessarily. On architectures where they can be implemented with a single instruction, they will typically be the fastest way to compute that function (but still not necessarily a single clock cycle). On architectures where they cannot be implemented as a single instruction, their performance is less certain.
On my processor (a Core 2 Duo), __builtin_ctz and __builtin_clz can be implemented with a single instruction (Bit Scan Forward and Bit Scan Reverse). However, __builtin_popcount cannot be implemented with a single instruction on my processor. For __builtin_popcount, gcc 4.7.2 calls a library function, while clang 3.1 generates an inline instruction sequence (implementing this bit twiddling hack). Clearly, the performance of those two implementations will not be the same.
Are they portable?
They are not portable across compilers. They originated with GCC (as far as I know), and are also implemented in some other compilers such as Clang.
Compilers that do support these functions may provide them for multiple architectures, but implementation quality (performance) is likely to vary.
__builtin functions like this are used to access specific machine instructions in a somewhat easier way than using inline assembly. If you need to achieve the highest performance and are willing to sacrifice portability to do so or to provide an alternate implementation for compilers or platforms where these functions are not provided, then it makes sense to use them. If optimal low level performance is your goal you should also check the assembly output of the compiler, to determine whether it really is generating the instruction that you expect it to use.
You can get a first idea of what your compiler does with it by compiling it with -O3 -march=native -S into assembler code. There you can check if this resolves to just one assembler statement. If so, this is not a guarantee that this is done in one cycle. To know the real cost, you'd have to measure.
I have some problem with SSE on ubuntu linux system.
example source code on msdn(sse4)
use sse4.1 operation on linux
gcc -o test test.c -msse4.1
then error message:
error: request for member 'm128i_u16' in something not a structure or union
How can I use this example code?
Or any example code can use?
The title of the code sample is "Microsoft Specific". This means that those functions are specific to the microsoft implementation of c++, and aren't cross-platform. Here are some Intel-specific guides to SSE instructions. Here is gcc documentation concerning command-line flags for specific assembly optimizations, including SSE. Good luck, SSE can get a bit hairy.
This is not so much about Microsoft-specific intrinsic functions, it is about the datatype. The actual intrinsics are 100% identical in both compilers, and are de facto standard (stemming from Intel).
The problem you are facing is that the __m128i type is -- as a convenience feature -- a union under MSVC, which includes fields such as m128i_u16. The code sample you link to assumes this.
Under gcc, __m128i is not a union and therefore, unsurprisingly, does not have these fields. This is not really a downside, because accessing fields in an union like this anihilates any gains you might have from using SSE in the first place, so other than in demo snippets like the above, you will (almost) never want to use such a thing.