just found out, that _mm_broadcastsd_pd, which is listed in the intel intrinsics guide (link), is not implemented in GCCs avx2intrin.h. I tested a small example on Godbolt with the latest GCC version and it won't compile (Example GCC). Clang does (Example Clang). It's the same on my computer (GCC 8.3).
Should I file a bug report or is there any particular reason why it is not included? I mean, sure, _mm_movedup_pd does exactly the same thing and clang actually generates the same assembly for both intrinsics, but I think that shouldn't be a reason to exclude it.
Greetings
Edit
Created a bug report: link
Not all compilers have all aliases for an intrinsic (different names for the same thing). Other than trying them on Godbolt, IDK how to find out which ones are portable across current versions of the major 4 compilers.
But yes, GCC/clang do accept bugs about missing _mm intrinsics, especially ones that Intel documents.
_mm_broadcastsd_pd is documented by Intel as being an intrinsic for movddup so you're not missing out on anything. More importantly, it's a bit misleading because there is no vbroadcastsd xmm, xmm, only with a YMM or ZMM destination. (_mm256_broadcast_sd(double *a); and _mm256_broadcastsd_pd(__m128d a);)
The asm reference manual doesn't even document _mm_broadcastsd_pd in the vbroadcast or the movddup entry; it's only in the intrinsics guide.
GCC would probably want to add this, especially since clang has it. Having _mm_broadcastsd_pd as an alias would be useful for people that are looking for it and don't know the asm well enough to know that they need a movddup. (Or with AVX 3-operand instructions, movlhps or unpcklpd same,same)
Related
I have a program that makes heavy use of the intrinsic command _BitScanForward / _BitScanForward64 (aka count trailing zeros, TZCNT, CTZ).
I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).
When using gcc or clang (where the intrinsic is called __builtin_ctz), I can achieve this by specifying either -march=haswell or -mbmi2 as compiler flags.
The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.
I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.
Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?
UPDATE
Here is what it looks like with Godbolt.
Please be nice, my assembly reading skills are pretty basic.
GCC uses tzcnt with haswell/bmi2, otherwise resorts to rep bsf.
MSVC uses bsf without rep.
I also found this useful answer, which states that:
"Using a redundant rep prefix for bsr was generally defined to be ignored [...]". I wonder whether the same is true for bsf?
It explains (as I knew) that bsf is not the same as tzcnt, however MSVC doesn't appear to check for input == 0
This adds the questions: Why does bsf work for MSVC?
UPDATE
Okay, this was easy, I actually call _BitScanForward for MSVC. Doh!
UPDATE
So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt, but that doesn't exist in MSVC so I resorted to _BitScanForward plus an extra check to account for 0 input.
However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).
Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?
Answer: see here. Specifically: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."
The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu and then call either bsr or lzcnt.
In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).
As I posted above, MSVC doesn't appear to have support for different CPU architectures (beyond x86/64/ARM).
This article says: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."
The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpuid and then call either bsr or lzcnt depending on the result.
UPDATE
As #dewaffled pointed out, there are indeed _tzcnt_u32 / _tzcnt_u64 in the x64 intrinsics list.
I got mislead by looking at the Alphabetical listing of intrinsic functions on the left side of the pane. I wonder whether there is a distinction between "intrinsics" and "intrinsic functions", i.e. _tzcnt_u64 is an intrinsic but not an intrinsic function.
I keep reading opinions on which header file is better to include to access Intel's intrinsics : x86intrin.h or immintrin.h .
Both seem to achieve an identical outcome, but I'm sure there must some subtle differences, with regards to code portability. Maybe one is more common, or more complete, than the other ?
I couldn't find an explanation on any of them. If anyone knows why there are 2 files, and what differences they have, this would be a welcomed SO answer.
Speaking of portability, for older compilers (like gcc < v4.4.0), of course things become more complex, and neither is available. One has to consider including another intrinsic header (likely emmintrin.h for SSE support).
(posting an answer here because Header files for x86 SIMD intrinsics has out of date answers that suggest including individual header files).
immintrin.h is portable across all compilers, and includes all Intel SIMD intrinsics, and some scalar extensions like _pdep_u32 that are available with -mbmi2 or a -march= that includes it. (For AMD SSE4a and XOP (Bulldozer-family only, dropped for Zen), you need to include a different header as well.)
The only reason I can think of for including <emmintrin.h> specifically would be if you're using MSVC and want to leave intrinsics undefined for ISA extensions you don't want to depend on.
GCC's model of requiring you to enable extensions before you can use intrinsics for them means the compiler does this checking for you, so you can just #include <immintrin.h> but still get an error if you try to use _mm_shuffle_epi8 (pshufb) without -mssse3.
Don't use compilers older than gcc4.4. They're obsolete and will typically generate slower code, especially for modern CPUs that didn't exist when their tuning settings were being decided.
gcc/clang's x86intrin.h vs. MSVC intrin.h are only useful if you need some extra non-SIMD intrinsics like MSVC's _BitScanReverse() that aren't always portable across compilers. Stuff like integer rotate / bit-scan intrinsics that are baseline (unlike BMI1 lzcnt/tzcnt or BMI2 rorx) but hard or impossible to express in C in a way that compilers will recognize and turn a loop back into a single instruction.
Intel documents some of those as being available in immintrin.h in their intrinsics guide, but gcc/clang and MSVC actually have them in their x86intrin.h or intrin.h headers, respectively.
See How to get the CPU cycle count in x86_64 from C++? for an example of using #ifdef _MSC_VER to choose the right header to define uint64_t __rdtsc(void) and __rdtscp().
Is it possible to do it now in D out of the box ? I'm using LDC2 compiler if it can help.
I'm interested using AVX intrinsics.
At the moment DMD has no AVX intrinsics. Considering that all D compilers use the DMD frontend, and the druntime and phobos, I would say that the only way to do what you want is to use the in-line assembly as suggested by BCS.
I would advise you to check from time to time the core.simd module and see if AVX intrinsics are added.
There is inline ASM. I think DMD supports the SIMD instructions. Not sure what the story for LDC is.
With LDC, module ldc.gccbuiltins_x86 contains GCC-style builtins like __builtin_ia32_vfnmaddps256.
(there is also ldc.gccbuiltins_arm, and ldc.gccbuiltins_ppc, ...)
I have some problem with SSE on ubuntu linux system.
example source code on msdn(sse4)
use sse4.1 operation on linux
gcc -o test test.c -msse4.1
then error message:
error: request for member 'm128i_u16' in something not a structure or union
How can I use this example code?
Or any example code can use?
The title of the code sample is "Microsoft Specific". This means that those functions are specific to the microsoft implementation of c++, and aren't cross-platform. Here are some Intel-specific guides to SSE instructions. Here is gcc documentation concerning command-line flags for specific assembly optimizations, including SSE. Good luck, SSE can get a bit hairy.
This is not so much about Microsoft-specific intrinsic functions, it is about the datatype. The actual intrinsics are 100% identical in both compilers, and are de facto standard (stemming from Intel).
The problem you are facing is that the __m128i type is -- as a convenience feature -- a union under MSVC, which includes fields such as m128i_u16. The code sample you link to assumes this.
Under gcc, __m128i is not a union and therefore, unsurprisingly, does not have these fields. This is not really a downside, because accessing fields in an union like this anihilates any gains you might have from using SSE in the first place, so other than in demo snippets like the above, you will (almost) never want to use such a thing.
When I compile an application with Intel's compiler it is slower than when I compile it with GCC. The Intel compiler's output is more than 2x slower. The application contains several nested loops. Are there any differences between GCC and the Intel compiler that I am missing? Do I need to turn on some other flags to improve the Intel compiler's performance? I expected the Intel compiler to be at least as fast as GCC.
Compiler Versions:
Intel version 12.0.0 20101006
GCC version 4.4.4 20100630
The compiler flags are the same with both compilers:
-O3 -openmp -parallel -mSSE4.2 -Wall -pthread
I have no experience with the intel compiler so I can't answer whether you are missing some flags or not.
However from what I recall recent versions of gcc are generally as good at optimizing code as icc (sometimes better, sometimes worse (although most sources seem to indicate to generally better)), so you might have run into a situation where icc is particulary bad. Examples for what optimizations each compiler can do can be found here and here. Even if gcc is not generally better you could simply have a case which gcc recognizes for optimization and icc doesn't. Compilers can be very picky about what they optimize and what not, especially regarding things like autovectorization.
If your loop is small enough it might be worth it to compare the generated assembly code between gcc and icc. Also if you show some code or at least tell us what you are doing in your loop we might be able to give you better speculations what leads to this behaviour. For example in some situations. If it's a relatively small loop it is likely a case of icc missing one (or some, but probably not many) optimization which either have inherently good potential (prefetching, autovectorization, unrolling, loop invariant motion,...) or which enable other optimizations (primarily inlining).
Note that I'm only talking about optimization potential when I compare gcc to icc. In the end icc might typically generate faster code then gcc, but not so much because it does more optimizations, but because it has a faster standard library implementation and because it is smarter about where to optimize (on high optimization levels gcc gets a little bit overeager (or at least it used to) about trading code size for (theoretical) runtime improvements. This can actually hurt performance, e.g. when the carefully unrolled and vectorized loop is only ever executed with 3 iterations.
I normally use -inline-level=1 -inline-forceinline to make sure that functions which I have explicitly declared inline actually do get inlined. Other than that I would expect ICC performance to be at least as good as with gcc. You will need to profile your code to see where the performance difference is coming from. If this is Linux then I recommend using Zoom, which you can get on a free 30 day evaluation.