Is it possible to do it now in D out of the box ? I'm using LDC2 compiler if it can help.
I'm interested using AVX intrinsics.
At the moment DMD has no AVX intrinsics. Considering that all D compilers use the DMD frontend, and the druntime and phobos, I would say that the only way to do what you want is to use the in-line assembly as suggested by BCS.
I would advise you to check from time to time the core.simd module and see if AVX intrinsics are added.
There is inline ASM. I think DMD supports the SIMD instructions. Not sure what the story for LDC is.
With LDC, module ldc.gccbuiltins_x86 contains GCC-style builtins like __builtin_ia32_vfnmaddps256.
(there is also ldc.gccbuiltins_arm, and ldc.gccbuiltins_ppc, ...)
Related
I keep reading opinions on which header file is better to include to access Intel's intrinsics : x86intrin.h or immintrin.h .
Both seem to achieve an identical outcome, but I'm sure there must some subtle differences, with regards to code portability. Maybe one is more common, or more complete, than the other ?
I couldn't find an explanation on any of them. If anyone knows why there are 2 files, and what differences they have, this would be a welcomed SO answer.
Speaking of portability, for older compilers (like gcc < v4.4.0), of course things become more complex, and neither is available. One has to consider including another intrinsic header (likely emmintrin.h for SSE support).
(posting an answer here because Header files for x86 SIMD intrinsics has out of date answers that suggest including individual header files).
immintrin.h is portable across all compilers, and includes all Intel SIMD intrinsics, and some scalar extensions like _pdep_u32 that are available with -mbmi2 or a -march= that includes it. (For AMD SSE4a and XOP (Bulldozer-family only, dropped for Zen), you need to include a different header as well.)
The only reason I can think of for including <emmintrin.h> specifically would be if you're using MSVC and want to leave intrinsics undefined for ISA extensions you don't want to depend on.
GCC's model of requiring you to enable extensions before you can use intrinsics for them means the compiler does this checking for you, so you can just #include <immintrin.h> but still get an error if you try to use _mm_shuffle_epi8 (pshufb) without -mssse3.
Don't use compilers older than gcc4.4. They're obsolete and will typically generate slower code, especially for modern CPUs that didn't exist when their tuning settings were being decided.
gcc/clang's x86intrin.h vs. MSVC intrin.h are only useful if you need some extra non-SIMD intrinsics like MSVC's _BitScanReverse() that aren't always portable across compilers. Stuff like integer rotate / bit-scan intrinsics that are baseline (unlike BMI1 lzcnt/tzcnt or BMI2 rorx) but hard or impossible to express in C in a way that compilers will recognize and turn a loop back into a single instruction.
Intel documents some of those as being available in immintrin.h in their intrinsics guide, but gcc/clang and MSVC actually have them in their x86intrin.h or intrin.h headers, respectively.
See How to get the CPU cycle count in x86_64 from C++? for an example of using #ifdef _MSC_VER to choose the right header to define uint64_t __rdtsc(void) and __rdtscp().
just found out, that _mm_broadcastsd_pd, which is listed in the intel intrinsics guide (link), is not implemented in GCCs avx2intrin.h. I tested a small example on Godbolt with the latest GCC version and it won't compile (Example GCC). Clang does (Example Clang). It's the same on my computer (GCC 8.3).
Should I file a bug report or is there any particular reason why it is not included? I mean, sure, _mm_movedup_pd does exactly the same thing and clang actually generates the same assembly for both intrinsics, but I think that shouldn't be a reason to exclude it.
Greetings
Edit
Created a bug report: link
Not all compilers have all aliases for an intrinsic (different names for the same thing). Other than trying them on Godbolt, IDK how to find out which ones are portable across current versions of the major 4 compilers.
But yes, GCC/clang do accept bugs about missing _mm intrinsics, especially ones that Intel documents.
_mm_broadcastsd_pd is documented by Intel as being an intrinsic for movddup so you're not missing out on anything. More importantly, it's a bit misleading because there is no vbroadcastsd xmm, xmm, only with a YMM or ZMM destination. (_mm256_broadcast_sd(double *a); and _mm256_broadcastsd_pd(__m128d a);)
The asm reference manual doesn't even document _mm_broadcastsd_pd in the vbroadcast or the movddup entry; it's only in the intrinsics guide.
GCC would probably want to add this, especially since clang has it. Having _mm_broadcastsd_pd as an alias would be useful for people that are looking for it and don't know the asm well enough to know that they need a movddup. (Or with AVX 3-operand instructions, movlhps or unpcklpd same,same)
I am trying to utilise some AVX intrinsics in my code and have run into a brick wall with the logarithm intrinsics.
Using the Intel Intrinsics Guide v3.0.1 for Linux, I see the intrinsic _mm256_log_ps(__m256) listed as being part of "immintrin.h" and also supported on my current arch.
However trying to compile this simple test case fails with "error: ‘_mm256_log_ps’ was not declared in this scope"
The example was compiled with g++-4.8 -march=native -mavx test.cpp
#include <immintrin.h>
int main()
{
__m256 i;
_mm256_log_ps(i);
}
Am I missing something fundamental here? Are certain intrinsics not supported by g++ and only available in icc?
SOLVED: This instruction is not a true intrinsic but instead implemented as part of the Intel SVML for ICC.
As indicated in the comments to your question, that intrinsic doesn't map to an actual AVX instruction; it is an Intel extension to the intrinsic set. The implementation likely uses many underlying instructions, as a logarithm isn't a trivial operation.
If you'd like to use a non-Intel compiler but want a fast logarithm implementation, you might check out this open-source implementation of sin(), cos(), exp(), and log() functions using AVX. They are based on an earlier SSE2 version of the same functions.
I've posted my implementation of _mm256_log_pd(__m256d) here: https://stackoverflow.com/a/45898937/1915854 . With some effort you should be able to extend it to 8 packed floats instead of 4 doubles, though you need to revise the bit manipulations. And some parts are easies because you don't need to repack odd-/even-numbered 32-bit components of __m256i into __m128i.
I have some problem with SSE on ubuntu linux system.
example source code on msdn(sse4)
use sse4.1 operation on linux
gcc -o test test.c -msse4.1
then error message:
error: request for member 'm128i_u16' in something not a structure or union
How can I use this example code?
Or any example code can use?
The title of the code sample is "Microsoft Specific". This means that those functions are specific to the microsoft implementation of c++, and aren't cross-platform. Here are some Intel-specific guides to SSE instructions. Here is gcc documentation concerning command-line flags for specific assembly optimizations, including SSE. Good luck, SSE can get a bit hairy.
This is not so much about Microsoft-specific intrinsic functions, it is about the datatype. The actual intrinsics are 100% identical in both compilers, and are de facto standard (stemming from Intel).
The problem you are facing is that the __m128i type is -- as a convenience feature -- a union under MSVC, which includes fields such as m128i_u16. The code sample you link to assumes this.
Under gcc, __m128i is not a union and therefore, unsurprisingly, does not have these fields. This is not really a downside, because accessing fields in an union like this anihilates any gains you might have from using SSE in the first place, so other than in demo snippets like the above, you will (almost) never want to use such a thing.
I´m compiling and ios project using an opencv framework, so I´m interested to know what are the best compiler flags to my project.
The project process a lot of matrix pixels , so I need from the side of the compiler to have SIMD instructions to be able to process this matrix as efficient as possible.
I using this flags :-mfpu=neon, -mfloat-abi=softfp and -O3,
And I also find this other flags:
-mno-thumb
-mfpu=maverick
-ftree-vectorize
-DNS_BLOCK_ASSERTIONS=1
I don´t know really if it is going to save me a lot of cpu processing, I search through google, but I didn´t find something that give me good reasons to know the best compiler flags.
Thanks
I am also using the same flags that you use for neon. No optimization would be done on neon intrinsic codes according to the optimization level O3 or anything. It just optimizes the ARM code.
As said by Vasile the best performance can be gained by writing the neon codes in assembly.
The easiest way is to write a program in which intrinsic neon codes are used and compile it using the flags you mentioned. Now use the assembly code generated for the code for further optimization.
A lot of optimization can be done by parallelizing or making use of dual instruction capabilities of neon.
The problem is that compilers are not so good at generating vectorized code. So, by just enabling NEON you'll not get much improvements (maybe 10% ??)
what you can do is to profile your app and write by hand those parts that eats your time, using NEON. And if you do it, why not patch them into the public OpenCV source?
By now, OpenCV has little to no code optimized for NEON (for the x86 SSE2, it is much better optimized).