I have a program that makes heavy use of the intrinsic command _BitScanForward / _BitScanForward64 (aka count trailing zeros, TZCNT, CTZ).
I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).
When using gcc or clang (where the intrinsic is called __builtin_ctz), I can achieve this by specifying either -march=haswell or -mbmi2 as compiler flags.
The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.
I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.
Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?
UPDATE
Here is what it looks like with Godbolt.
Please be nice, my assembly reading skills are pretty basic.
GCC uses tzcnt with haswell/bmi2, otherwise resorts to rep bsf.
MSVC uses bsf without rep.
I also found this useful answer, which states that:
"Using a redundant rep prefix for bsr was generally defined to be ignored [...]". I wonder whether the same is true for bsf?
It explains (as I knew) that bsf is not the same as tzcnt, however MSVC doesn't appear to check for input == 0
This adds the questions: Why does bsf work for MSVC?
UPDATE
Okay, this was easy, I actually call _BitScanForward for MSVC. Doh!
UPDATE
So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt, but that doesn't exist in MSVC so I resorted to _BitScanForward plus an extra check to account for 0 input.
However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).
Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?
Answer: see here. Specifically: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."
The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu and then call either bsr or lzcnt.
In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).
As I posted above, MSVC doesn't appear to have support for different CPU architectures (beyond x86/64/ARM).
This article says: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."
The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpuid and then call either bsr or lzcnt depending on the result.
UPDATE
As #dewaffled pointed out, there are indeed _tzcnt_u32 / _tzcnt_u64 in the x64 intrinsics list.
I got mislead by looking at the Alphabetical listing of intrinsic functions on the left side of the pane. I wonder whether there is a distinction between "intrinsics" and "intrinsic functions", i.e. _tzcnt_u64 is an intrinsic but not an intrinsic function.
Related
I keep reading opinions on which header file is better to include to access Intel's intrinsics : x86intrin.h or immintrin.h .
Both seem to achieve an identical outcome, but I'm sure there must some subtle differences, with regards to code portability. Maybe one is more common, or more complete, than the other ?
I couldn't find an explanation on any of them. If anyone knows why there are 2 files, and what differences they have, this would be a welcomed SO answer.
Speaking of portability, for older compilers (like gcc < v4.4.0), of course things become more complex, and neither is available. One has to consider including another intrinsic header (likely emmintrin.h for SSE support).
(posting an answer here because Header files for x86 SIMD intrinsics has out of date answers that suggest including individual header files).
immintrin.h is portable across all compilers, and includes all Intel SIMD intrinsics, and some scalar extensions like _pdep_u32 that are available with -mbmi2 or a -march= that includes it. (For AMD SSE4a and XOP (Bulldozer-family only, dropped for Zen), you need to include a different header as well.)
The only reason I can think of for including <emmintrin.h> specifically would be if you're using MSVC and want to leave intrinsics undefined for ISA extensions you don't want to depend on.
GCC's model of requiring you to enable extensions before you can use intrinsics for them means the compiler does this checking for you, so you can just #include <immintrin.h> but still get an error if you try to use _mm_shuffle_epi8 (pshufb) without -mssse3.
Don't use compilers older than gcc4.4. They're obsolete and will typically generate slower code, especially for modern CPUs that didn't exist when their tuning settings were being decided.
gcc/clang's x86intrin.h vs. MSVC intrin.h are only useful if you need some extra non-SIMD intrinsics like MSVC's _BitScanReverse() that aren't always portable across compilers. Stuff like integer rotate / bit-scan intrinsics that are baseline (unlike BMI1 lzcnt/tzcnt or BMI2 rorx) but hard or impossible to express in C in a way that compilers will recognize and turn a loop back into a single instruction.
Intel documents some of those as being available in immintrin.h in their intrinsics guide, but gcc/clang and MSVC actually have them in their x86intrin.h or intrin.h headers, respectively.
See How to get the CPU cycle count in x86_64 from C++? for an example of using #ifdef _MSC_VER to choose the right header to define uint64_t __rdtsc(void) and __rdtscp().
Just interesting how it works in games and other software.
More precisely, I'm asking for a solution in C++.
Something like:
if AMX available -> Use AMX version of the math library
else if AVX-512 available -> Use AVX-512 version of the math library
else if AVX-256 available -> Use AVX-256 version of the math library
etc.
The basic idea I have is to compile the library in different DLLs and swap them on runtime but it seems not to be the best solution for me.
For the detection part
See Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support? which shows how to detect CPU and OS support for new extensions: cpuid and xgetbv, respectively.
ISA extensions that add new/wider registers that need to be saved/restored on context switch also need to be supported and enabled by the OS, not just the CPU. New instructions like AVX-512 will still fault on a CPU that supports them if the OS hasn't set a control-register bit. (Effectively promising that it knows about them and will save/restore them.) Intel designed things so the failure mode is faulting, not silent corruption of registers on CPU migration, or context switch between two programs using the extension.
Extensions that added new or wider registers are AVX, AVX-512F, and AMX. OSes need to know about them. (AMX is very new, and adds a large amount of state: 8 tile registers T0-T7 of 1KiB each. Apparently OSes need to know about AMX for power-management to work properly.)
OSes don't need to know about AVX2/FMA3 (still YMM0-15), or any of the various AVX-512 extensions which still use k0-k7 and ZMM0-31.
There's no OS-independent way to detect OS support of SSE, but fortunately it's old enough that these days you don't have to. It and SSE2 are baseline for x86-64. Everything up to SSE4.2 uses the same register state (XMM0-15) so OS support for SSE1 is sufficient for user-space to use SSE4.2. SSE1 was new in 1999, with Pentium 3.
Different compilers have different ways of doing CPUID and xgetbv detection. See does gcc's __builtin_cpu_supports check for OS support? - unfortunately no, only CPUID, at least when that was asked. I'd consider that a GCC bug, but IDK if it ever got reported or fixed.
For the optional-use part
Typically setting function pointers to selected versions of some important functions. Inlining through function pointers isn't generally possible, so make sure you choose the boundaries appropriately, like an AVX-512 version of a function that includes a loop, not just a single vector.
GCC's function multi-versioning can automate that for you, transparently compiling multiple versions and hooking some function-pointer setup.
There have been some previous Q&As about this with different compilers, search for "CPU dispatch avx" or something like that, along with other search terms.
See The Effect of Architecture When Using SSE / AVX Intrinisics to understand the difference between GCC/clang's model for intrinsics where you have to enable -march=skylake or whatever, or manually -mavx2, before you can use an intrinsic. vs. MSVC and classic ICC where you could use any intrinsic anywhere, even to emit instructions the compiler wouldn't be able to auto-vectorize with. (Those compilers can't or don't optimize intrinsics much at all, perhaps because that could lead to them getting hoisted out of if(cpu) statements.)
Windows provides IsProcessorFeaturePresent but AVX support is not on the list.
For more detailed detection you need to ask the CPU directly. On x86 this means the CPUID instruction. Visual C++ provides the __cpuidex intrinsic for this. In your case, function/leaf 1 and check bit 28 in ECX. Wikipedia has a decent article but you really should download the Intel instruction set manual to use as a reference.
just found out, that _mm_broadcastsd_pd, which is listed in the intel intrinsics guide (link), is not implemented in GCCs avx2intrin.h. I tested a small example on Godbolt with the latest GCC version and it won't compile (Example GCC). Clang does (Example Clang). It's the same on my computer (GCC 8.3).
Should I file a bug report or is there any particular reason why it is not included? I mean, sure, _mm_movedup_pd does exactly the same thing and clang actually generates the same assembly for both intrinsics, but I think that shouldn't be a reason to exclude it.
Greetings
Edit
Created a bug report: link
Not all compilers have all aliases for an intrinsic (different names for the same thing). Other than trying them on Godbolt, IDK how to find out which ones are portable across current versions of the major 4 compilers.
But yes, GCC/clang do accept bugs about missing _mm intrinsics, especially ones that Intel documents.
_mm_broadcastsd_pd is documented by Intel as being an intrinsic for movddup so you're not missing out on anything. More importantly, it's a bit misleading because there is no vbroadcastsd xmm, xmm, only with a YMM or ZMM destination. (_mm256_broadcast_sd(double *a); and _mm256_broadcastsd_pd(__m128d a);)
The asm reference manual doesn't even document _mm_broadcastsd_pd in the vbroadcast or the movddup entry; it's only in the intrinsics guide.
GCC would probably want to add this, especially since clang has it. Having _mm_broadcastsd_pd as an alias would be useful for people that are looking for it and don't know the asm well enough to know that they need a movddup. (Or with AVX 3-operand instructions, movlhps or unpcklpd same,same)
I want know the inner workings of "__builtin_popcount".
As much as I understand, it works differently for different cpu.
Similar to many other built-ins, it translates into specific CPU instruction if one is available on the target CPU, thus considerably speeding up the application.
For example, on x86_64 it translates to popcntl ASM instruction.
Additional information can be found on GCC page: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
It is also worth noting that the actual speedup could only be seen if gcc is ran with march flag which targets architecture supporting this instruction or an argument which specifically enables it, -mpopcnt. Without either of those, gcc will revert to generic bit counting via bit operations.
Are the following functions executed in a single clock cycle?
__builtin_popcount
__builtin_ctz
__builtin_clz
also what is the no of clock cycles for the ll(64 bit) version of the same.
are they portable. why or why not?
Do these functions execute in a single clock-cycle?
Not necessarily. On architectures where they can be implemented with a single instruction, they will typically be the fastest way to compute that function (but still not necessarily a single clock cycle). On architectures where they cannot be implemented as a single instruction, their performance is less certain.
On my processor (a Core 2 Duo), __builtin_ctz and __builtin_clz can be implemented with a single instruction (Bit Scan Forward and Bit Scan Reverse). However, __builtin_popcount cannot be implemented with a single instruction on my processor. For __builtin_popcount, gcc 4.7.2 calls a library function, while clang 3.1 generates an inline instruction sequence (implementing this bit twiddling hack). Clearly, the performance of those two implementations will not be the same.
Are they portable?
They are not portable across compilers. They originated with GCC (as far as I know), and are also implemented in some other compilers such as Clang.
Compilers that do support these functions may provide them for multiple architectures, but implementation quality (performance) is likely to vary.
__builtin functions like this are used to access specific machine instructions in a somewhat easier way than using inline assembly. If you need to achieve the highest performance and are willing to sacrifice portability to do so or to provide an alternate implementation for compilers or platforms where these functions are not provided, then it makes sense to use them. If optimal low level performance is your goal you should also check the assembly output of the compiler, to determine whether it really is generating the instruction that you expect it to use.
You can get a first idea of what your compiler does with it by compiling it with -O3 -march=native -S into assembler code. There you can check if this resolves to just one assembler statement. If so, this is not a guarantee that this is done in one cycle. To know the real cost, you'd have to measure.