MSVC's instrinsics __emulu and _umul128 in GCC/CLang - c++

In MSVC there exist instrinsics __emulu() and _umul128(). First does u32*u32->u64 multiplication and second u64*u64->u128 multiplication.
Do same intrinsics exist for CLang/GCC?
Closest I found are _mulx_u32() and _mulx_u64() mentioned in Intel's Guide. But they produce mulx instruction which needs BMI2 support. While MSVC's intrinsics produce regular mul instruction. Also _mulx_u32() is not available in -m64 mode, while __emulu() and _umul128() both exist in 32 and 64 bit mode of MSVC.
You may try online 32-bit code and 64-bit code.
Of cause for 32-bit one may do return uint64_t(a) * uint64_t(b); (see it online) hoping that compiler will guess correctly and optimize to using u32*u32->u64 multiplication instead of u64*u64->u64. But is there a way to be sure about this? Not to rely on compiler's guess that both arguments are 32-bit (i.e. higher part of uint64_t is zeroed)? To have some intrinsics like __emulu() that make you sure about code.
There is __int128 in GCC/CLang (see code online) but again we have to rely on compiler's guess that we actually multiply 64-bit numbers (i.e. higher part of int128 is zeroed). Is there a way to be sure without compiler guessing, if there exist some intrinsics for that?
BTW, both uint64_t (for 32-bit) and __int128 (for 64-bit) produce correct mul instruction instead of mulx in GCC/CLang. But again we have to rely that compiler guesses correctly that higher part of uint64_t and __int128 is zeroed.
Of cause I can look into assembler code that GCC/Clang have optimized and guessed correctly, but looking at assembler once doesn't guarantee that same will happen always in all circumstances. And I don't know of a way in C++ to statically assert that compiler did correct guess about assembler instructions.

You already have the answer. Use uint64_t and __uint128_t. No intrinsics needed. This is available with modern GCC and Clang for all 64-bit targets. See Is there a 128 bit integer in gcc?
#include <stdint.h>
typedef __uint128_t uint128_t;
// 32*32=64 multiplication
f(uint32_t a, uint32_t b) {
uint64_t ab = (uint64_t)a * b;
}
//64*64=128 multiplication
f(uint64_t a, uint64_t b) {
uint128_t ab = (uint128_t)a * b;
}
Note that the cast must be on the operands, or at least on one operand. Casting the result would not work since it would do the multiplication with the shorter type and extend the result.
But is there a way to be sure about this? Not to rely on compiler's guess
You get exactly the same guarantee as with compiler intrinsics: that the value of the result is correct. There are never any guarantees about optimization. Just because you used intrinsics doesn't guarantee that the compiler will emit the “obvious” assembly instruction. The only way to have this guarantee is to use inline assembly, and for a simple operation like this it's likely to hurt performance because it would restrict the ways in which the compiler optimizes register usage.

Related

Compiler optimizations may cause integer overflow. Is that okay?

I have an int x. For simplicity, say ints occupy the range -2^31 to 2^31-1. I want to compute 2*x-1. I allow x to be any value 0 <= x <= 2^30. If I compute 2*(2^30), I get 2^31, which is an integer overflow.
One solution is to compute 2*(x-1)+1. There's one more subtraction than I want, but this shouldn't overflow. However, the compiler will optimize this to 2*x-1. Is this a problem for the source code? Is this a problem for the executable?
Here is the godbolt output for 2*x-1:
func(int): # #func(int)
lea eax, [rdi + rdi]
dec eax
ret
Here is the godbolt output for 2*(x-1)+1:
func(int): # #func(int)
lea eax, [rdi + rdi]
dec eax
ret
As Miles hinted: The C++ code text is bound by the rules of the C++ language (integer overflow = bad), but the compiler is only bound by the rules of the cpu (overflow=ok). It is allowed to make optimizations that the code isn't allowed to.
But don't take this as an excuse to get lazy. If you write undefined behavior, the compiler will take that as a hint and do other optimizations that result in your program doing the wrong thing.
Just because signed integer overflow isn't well-defined at the C++ language level doesn't mean that's the case at the assembly level. It's up to the compiler to emit assembly code that is well-defined on the CPU architecture you're targeting.
I'm pretty sure every CPU made in this century has used two's complement signed integers, and overflow is perfectly well defined for those. That means there is no problem simply calculating 2*x, letting the result overflow, then subtracting 1 and letting the result underflow back around.
Many such C++ language-level rules exist to paper over different CPU architectures. In this case, signed integer overflow was made undefined so that compilers targeting CPUs that use e.g. one's complement or sign/magnitude representations of signed integers aren't forced to add extra instructions to conform to the overflow behavior of two's complement.
Don't assume, however, that you can use a construct that is well-defined on your target CPU but undefined in C++ and get the answer you expect. C++ compilers assume undefined behavior cannot happen when performing optimization, and so they can and will emit different code from what you were expecting if your code isn't well-defined C++.
The ISO C++ rules apply to your source code (always, regardless of the target machine). Not to the asm the compiler chooses to make, especially for targets where signed integer wrapping just works.
The "as if" rules requires that the asm implementation of the function produce the same result as the C++ abstract machine, for every input value where the abstract machine doesn't encounter signed integer overflow (or other undefined behaviour). It doesn't matter how the asm produces those results, that's the entire point of the as-if rule. In some cases, like yours, the most efficient implementation would wrap and unwrap for some values that the abstract machine wouldn't. (Or in general, not wrap where the abstract machine does for unsigned or gcc -fwrapv.)
One effect of signed integer overflow being UB in the C++ abstract machine is that it lets the compiler optimize an int loop counter to pointer width, not redoing sign-extension every time through the loop or things like that. Also, compilers can infer value-range restrictions. But that's totally separate from how they implement the logic into asm for some target machine. UB doesn't mean "required to fail", in fact just the opposite, unless you compile with -fsanitize=undefined. It's extra freedom for the optimizer to make asm that doesn't match the source if you interpreted the source with more guarantees than ISO C++ actually gives (plus any guarantees the implementation makes beyond that, like if you use gcc -fwrapv.)
For an expression like x/2, every possible int x has well-defined behaviour. For 2*x, the compiler can assume that x >= INT_MIN/2 and x <= INT_MAX/2, because larger magnitudes would involve UB.
2*(x-1)+1 implies a legal value-range for x from (INT_MIN+1)/2 to (INT_MAX+1)/2. e.g. on a 32-bit 2's complement target, -1073741823 (0xc0000001) to 1073741824 (0x40000000). On the positive side, 2*0x3fffffff doesn't overflow, doesn't wrap on increment because 2*x was even.
2*x - 1 implies a legal value-range for x from INT_MIN/2 + 1 to INT_MAX/2. e.g. on a 32-bit 2's complement target, -1073741823 (0xc0000001) to 1073741823 (0x3fffffff). So the largest value the expression can produce is 2^n - 3, because INT_MAX will be odd.
In this case, the more complicated expression's legal value-range is a superset of the simpler expression, but in general that's not always the case.
They produce the same result for every x that's a well-defined input for both of them. And x86 asm (where wrapping is well-defined) that works like one or the other can implement either, producing correct results for all non-UB cases. So the compiler would be doing a bad job if it didn't make the same efficient asm for both.
In general, 2's complement and unsigned binary integer math is commutative and associative (for operations where that's mathematically true, like + and *), and compilers can and should take full advantage. e.g. rearranging a+b+c+d to (a+b)+(c+d) to shorten dependency chains. (See an answer on Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for an example of GCC doing it with integer, but not FP.)
Unfortunately, GCC has sometimes been reluctant to do signed-int optimizations like that because its internals were treating signed integer math as non-associative, perhaps because of a misguided application of C++ UB rules to optimizing asm for the target machine. That's a GCC missed optimization; Clang didn't have that problem.
Further reading:
Is there some meaningful statistical data to justify keeping signed integer arithmetic overflow undefined? re: some useful loop optimizations it allows.
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Does undefined behavior apply to asm code? (no)
Is integer overflow undefined in inline x86 assembly?
The whole situation is basically a mess, and the designers of C didn't anticipate the current sophistication of optimizing compilers. Languages like Rust are better suited to it: if you want wrapping, you can (and must) tell the compiler about it on a per-operation basis, for both signed and unsigned types. Like x.wrapping_add(1).
Re: why does clang split up the 2*x and the -1 with lea/dec
Clang is optimizing for latency on Intel CPUs before Ice lake, saving one cycle of latency at the cost of an extra uop of throughput cost. (Compilers often favour latency since modern CPUs are often wide enough to chew through the throughput costs, although it does eat up space in the out-of-order exec window for hiding cache miss latency.)
lea eax, [rdi + rdi - 1] has 3 cycle latency on Skylake, vs. 1 for the LEA it used. (See Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? for some details). On AMD Zen family, it's break-even for latency (a complex LEA only has 2c latency) while still costing an extra uop. On Ice Lake and later Intel, even a 3-component LEA is still only 1 cycle so it's pure downside there. See https://uops.info/, the entry for LEA_B_I_D8 (R32) (Base, Index, 8-bit displacement, with scale-factor = 1.)
This tuning decision is unrelated to integer overflow.
Signed integer overflow/underflow is undefined behavior precisely so that compilers may make optimizations such as this. Because the compiler is allowed to do anything in the case of overflow/underflow, it can do this, or whatever else is more optimal for the use cases it is required to care about.
If the behavior on signed overflow had been specified as “What the DEC PDP-8 did back in 1973,” compilers for other targets would need to insert instructions to check for overflow and, if it occurs, produce that result instead of whatever the CPU does natively.

Is using C++20's std::popcount with vector optimization is equivalent to popcnt intristic?

C++20 introduces many new functions such as std::popcount, I use the same functionality using an Intel Intrinsic.
I compiled both options - can be seen in Compiler Explorer code:
Using Intel's AVX2 intrinsic
Using std::popcount and GCC compiler flag "-mavx2"
It looks like the generated assembly code is the same, besides of the type checks used in std's template.
In terms of OS agnostic code and having the same optimizations -
Is it right to assume that using std::popcount and the apt compiler vector optimization flags is better than directly using intrinsics?
Thanks.
Technically No. (But practically, yes). The C++ standard only specifies the behavior of popcount, and not the implementation (Refer to [bit.count]).
Implementors are allowed to do whatever they want to achieve this behavior, including using the popcnt intrinsic, but they could also write a while loop:
int set_bits = 0;
while(x)
{
if (x & 1)
++set_bits;
x >>= 1;
}
return set_bits;
This is the entire wording in the standard at [bit.count]:
template<class T>
constexpr int popcount(T x) noexcept;
Constraints: T is an unsigned integer type ([basic.fundamental]).
Returns: The number of 1 bits in the value of x.
Realistically? Compiler writers are very smart and will optimize this to use intrinsics as much as possible. For example, gcc's implementation appears to be fairly heavily optimized.

How to emulate _mm256_loadu_epi32 with gcc or clang?

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?
Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.
#chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.
From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.
You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
error: '_mm512_loadu_epi64' was not declared in this scope
What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Find position of first (lowest) set bit in 32-bit number

I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.
For example
input: 0x00000001, 0x10000000
output: 0, 28
#ifdef __GNUC__, use __builtin_ctz(unsigned) to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)
Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010 has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)
For 64-bit integers, use __builtin_ctzll(unsigned long long). It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long is always uint64_t on all GNU C targets I'm aware of.
On x86, it will compile to bsf or tzcnt depending on tuning + target options. tzcnt is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567 (on Intel CPUs, bsf/tzcnt has the same cost as imul r, r, imm, in front-end uops, back-end port, and latency.)
The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf.
In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32 from immintrin.h. (Intel intrinsics guide). Or you might need to include intrin.h (MSVC) or x86intrin.h for non-SIMD intrinsics.
Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.
MSVC also has _BitScanForward / _BitScanReverse for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.
VS: unexpected optimization behavior with _BitScanReverse64 intrinsic - pointer-output is assumed to always be written :/
_BitScanForward _BitScanForward64 missing (VS2017) Snappy - correct headers
How to use MSVC intrinsics to get the equivalent of this GCC code?
TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).
(The situation is less convenient for bsr vs. lzcnt: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt)
Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr is the same in this case as bsf = tzcnt. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.
Related:
Why does breaking the "output dependency" of LZCNT matter? modern compilers generally know to break the false dependency for lzcnt/tzcnt/popcnt. bsf/bsr have one, too, and I think GCC is also smart about that, but ironically might not be.
How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? - the pseudocode is not the hardware implementation.
https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX ffs() which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.
Compilers do recognize ffs() and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.

Compiling legacy GCC code with AVX vector warnings

I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);
That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.