x86 MUL Instruction from VS 2008/2010 - c++

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.
If VS does not compile using MUL, is there a rationale why?

imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).
186 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (According to NASM's appendix B, see also the x86 tag wiki)
When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)
In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.
As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.
The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.

There's three different types of multiply instructions on x86. The first is MUL reg, which does an unsigned multiply of EAX by reg and puts the (64-bit) result into EDX:EAX. The second is IMUL reg, which does the same with a signed multiply. The third type is either IMUL reg1, reg2 (multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm (multiplies reg2 by imm and stores the 32-bit result into reg1).
Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL/IMUL if you actually use the full 64-bit results, e.g. here:
unsigned long long prod(unsigned int a, unsigned int b)
{
return (unsigned long long) a * b;
}
The 2-operand (and 3-operand) versions of IMUL are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.

According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL and MUL always produce the same output).
I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL.
This:
unsigned int func(unsigned int a, unsigned int b)
{
return a * b;
}
Assembles to this (with -O2):
_func:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl %esi, %eax
imull %edi, %eax
movzbl %al, %eax
leave
ret

My intuition tells me that the compiler chose IMUL arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL or signed IMUL. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX. The overflow goes into EDX which is essentially ignored since we only care about the 32-bit result in EAX. Using IMUL will sign-extend into EDX as necessary but again, we don't care since we're only interested in the 32-bit result.

Right after I looked at this question I found MULQ in my generated code when dividing.
The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.
C++ code:
for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
Remainder <<= 32;
Remainder += Num;
Num = Remainder / 1000000000;
Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});
Optimized Generated Assembly
00007FF7715B18E8 lea r9,[rsi-4]
00007FF7715B18EC mov r13,12E0BE826D694B2Fh
00007FF7715B18F6 nop word ptr [rax+rax]
00007FF7715B1900 shl r8,20h
00007FF7715B1904 mov eax,dword ptr [r9]
00007FF7715B1907 add r8,rax
00007FF7715B190A mov rax,r13
00007FF7715B190D mul rax,r8
00007FF7715B1910 mov rcx,r8
00007FF7715B1913 sub rcx,rdx
00007FF7715B1916 shr rcx,1
00007FF7715B1919 add rcx,rdx
00007FF7715B191C shr rcx,1Dh
00007FF7715B1920 imul rax,rcx,3B9ACA00h
00007FF7715B1927 sub r8,rax
00007FF7715B192A mov dword ptr [r9],ecx
00007FF7715B192D lea r9,[r9-4]
00007FF7715B1931 lea rax,[r9+4]
00007FF7715B1935 cmp rax,r14
00007FF7715B1938 jne NumToString+0D0h (07FF7715B1900h)
Notice the MUL instruction 5 lines down.
This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.
I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.
This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.

As already explained C/C++ does not do word*word to double-word operations which is what the mul instruction is best for. But there are cases where you want word*word to double-word so you need an extension to C/C++.
GCC, Clang, and ICC provide provide a builtin type __int128 which you can use to indirectly get the mul instruction.
With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128 type with MSVC.

Related

Reinterpret casting from __m256i to __m256 [duplicate]

Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
None of the answers appear to actually answer the question, why does it return int.
The reason is, the extractps instruction actually copies a component of the vector to a general register. It does seem pretty silly for it to return an int but that's what's actually happening - the raw floating point value ends up in a general register (which hold integers).
If your compiler is configured to generate SSE for all floating point operations, then the closest thing to "extracting" a value to a register would be to shuffle the value into the low component of the vector, then cast it to a scalar float. This should cause that component of the vector to remain in an SSE register:
/* returns the second component of the vector */
float foo(__m128 b)
{
return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));
}
The _mm_cvtss_f32 intrinsic is free, it does not generate instructions, it only makes the compiler reinterpret the xmm register as a float so it can be returned as such.
The _mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro generates an immediate operand for the resulting shufps instruction.
The 2 in the example gets the float from bit 95:64 of the 127:0 register (the 3rd 32 bit component from the beginning, in memory order) and places it in the 31:0 component of the register (the beginning, in memory order).
The resulting generated code will most likely return the value naturally in a register, like any other floating point value return, with no inefficient writing out to memory and reading it back.
If you're generating code that uses the x87 FPU for floating point (for normal C code that isn't SSE optimized), this would probably result in inefficient code being generated - the compiler would probably store out the component of the SSE vector then use fld to read it back into the x87 register stack. In general 64-bit platforms don't use x87 (they use SSE for all floating point, mostly scalar instructions unless the compiler is vectorizing).
I should add that I always use C++, so I'm not sure whether it is more efficient to pass __m128 by value or by pointer in C. In C++ I would use a const __m128 & and this kind of code would be in a header, so the compiler can inline.
Confusingly, int _mm_extract_ps() is not for getting a scalar float element from a vector. The intrinsic doesn't expose the memory-destination form of the instruction (which can be useful for that purpose). This is not the only case where the intrinsics can't directly express everything an instruction is useful for. :(
gcc and clang know how the asm instruction works and will use it that way for you when compiling other shuffles; type-punning the _mm_extract_ps result to float usually results in horrible asm from gcc (extractps eax, xmm0, 2 / mov [mem], eax).
The name makes sense if you think of _mm_extract_ps as extracting an IEEE 754 binary32 float bit pattern out of the FP domain of the CPU into the integer domain (as a C scalar int), instead of manipulating FP bit patterns with integer vector ops. According to my testing with gcc, clang, and icc (see below), this is the only "portable" use-case where _mm_extract_ps compiles into good asm across all compilers. Anything else is just a compiler-specific hack to get the asm you want.
The corresponding asm instruction is EXTRACTPS r/m32, xmm, imm8. Notice that the destination can be memory or an integer register, but not another XMM register. It's the FP equivalent of PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer-register-destination form is more obviously useful. EXTRACTPS is not the reverse of INSERTPS xmm1, xmm2/m32, imm8.
Perhaps this similarity with PEXTRD makes the internal implementation simpler without hurting the extract-to-memory use-case (for asm, not intrinsics), or maybe the SSE4.1 designers at Intel thought it was actually more useful this way than as a non-destructive FP-domain copy-and-shuffle (which x86 seriously lacks without AVX). There are FP-vector instructions that have an XMM source and a memory-or-xmm destination, like MOVSS xmm2/m32, xmm, so this kind of instruction would not be new. Fun fact: the opcodes for PEXTRD and EXTRACTPS differ only in the last bit.
In assembly, a scalar float is just the low element of an XMM register (or 4 bytes in memory). The upper elements of the XMM don't even have to be zeroed for instructions like ADDSS to work without raising any extra FP exceptions. In calling conventions that pass/return FP args in XMM registers (e.g. all the usual x86-64 ABIs), float foo(float a) must assume that the upper elements of XMM0 hold garbage on entry, but can leave garbage in the high elements of XMM0 on return. (More info).
As #doug points out, other shuffle instructions can be used to get a float element of a vector into the bottom of an xmm register. This was already a mostly-solved problem in SSE1/SSE2, and it seems EXTRACTPS and INSERTPS weren't trying to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in a register and they can't/don't optimize away zeroing the upper elements. (Which is most of the time for compilers other than clang). That linked question also further discusses the failure of intrinsics to expose the load or store versions of instructions like EXTRACTPS, INSERTPS, and PMOVZX that have a memory operand narrower than 128b (thus not requiring alignment even without AVX). It can be impossible to write safe code that compiles as efficiently as what you can do in asm.
Without AVX 3-operand SHUFPS, x86 doesn't provide a fully efficient and general-purpose way to copy-and-shuffle an FP vector the way integer PSHUFD can. SHUFPS is a different beast unless used in-place with src=dst. Preserving the original requires a MOVAPS, which costs a uop and latency on CPUs before IvyBridge, and always costs code-size. Using PSHUFD between FP instructions costs latency (bypass delays). (See this horizontal-sum answer for some tricks, like using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element into a separate register, but AFAIK it still has a dependency on the previous value of the destination even when all the original values are replaced. False dependencies like that are bad for out-of-order execution. xor-zeroing a register as a destination for INSERTPS would still be 2 uops, and have lower latency than MOVAPS+SHUFPS on SSE4.1 CPUs without mov-elimination for zero-latency MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you include low-power CPUs). The code-size is slightly worse, though.
Using _mm_extract_ps and then type-punning the result back to float (as suggested in the currently-accepted answer and its comments) is a bad idea. It's easy for your code to compile to something horrible (like EXTRACTPS to memory and then load back into an XMM register) on either gcc or icc. Clang seems to be immune to braindead behaviour and does its usual shuffle-compiling with its own choice of shuffle instructions (including appropriate use of EXTRACTPS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell, clang3.8.1, and icc17, on the Godbolt compiler explorer. I used C mode, not C++, but union-based type punning is allowed in GNU C++ as an extension to ISO C++. Pointer-casting for type-punning violates strict aliasing in C99 and C++, even with GNU extensions.
#include <immintrin.h>
// gcc:bad clang:good icc:good
void extr_unsafe_ptrcast(__m128 v, float *p) {
// violates strict aliasing
*(int*)p = _mm_extract_ps(v, 2);
}
gcc: # others extractps with a memory dest
extractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:bad
void extr_pun(__m128 v, float *p) {
// union type punning is safe in C99 (and GNU C and GNU C++)
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
*p = fp.f; // compiles to an extractps straight to memory
}
icc:
vextractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:horrible
void extr_gnu(__m128 v, float *p) {
// gcc uses extractps with a memory dest, icc does extr_store
*p = v[2];
}
gcc/clang:
extractps DWORD PTR [rdi], xmm0, 2
icc:
vmovups XMMWORD PTR [-24+rsp], xmm0
mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:poor
void extr_shuf(__m128 v, float *p) {
__m128 e2 = _mm_shuffle_ps(v,v, 2);
*p = _mm_cvtss_f32(e2); // gcc uses extractps
}
icc: (others: extractps right to memory)
vshufps xmm1, xmm0, xmm0, 2
vmovss DWORD PTR [rdi], xmm1
When you want the final result in an xmm register, it's up to the compiler to optimize away your extractps and do something completely different. Gcc and clang both succeed, but ICC doesn't.
// gcc:good clang:good icc:bad
float ret_pun(__m128 v) {
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
return fp.f;
}
gcc:
unpckhps xmm0, xmm0
clang:
shufpd xmm0, xmm0, 1
icc17:
vextractps DWORD PTR [-8+rsp], xmm0, 2
vmovss xmm0, DWORD PTR [-8+rsp]
Note that icc did poorly for extr_pun, too, so it doesn't like union-based type-punning for this.
The clear winner here is doing the shuffle "manually" with _mm_shuffle_ps(v,v, 2), and using _mm_cvtss_f32. We got optimal code from every compiler for both register and memory destinations, except for ICC which failed to use EXTRACTPS for the memory-dest case. With AVX, SHUFPS + separate store is still only 2 uops on Intel CPUs, just larger code size and needs a tmp register. Without AVX, though, it would cost a MOVAPS to not destroy the original vector :/
According to Agner Fog's instruction tables, all Intel CPUs except Nehalem implement the register-destination versions of both PEXTRD and EXTRACTPS with multiple uops: Usually just a shuffle uop + a MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS is 1 uop for port 5, with 1+2 cycle latency (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop but not PEXTRD (which is 2 uops, and runs in 2+1 cycle latency). Nehalem MOVD is 1 uop (and runs on any ALU port), with 1+1 cycle latency. (The +1 is for the bypass delay between vec-int and general-purpose integer regs, I think).
Nehalem cares a lot of about vector FP vs. integer domains; SnB-family CPUs have smaller (sometimes zero) bypass delay latencies between domains.
The memory-dest versions of PEXTRD and EXTRACTPS are both 2 uops on Nehalem.
On Broadwell and later, memory-destination EXTRACTPS and PEXTRD are 2 uops, but on Sandybridge through Haswell, memory-destination EXTRACTPS is 3 uops. Memory-destination PEXTRD is 2 uops on everything except Sandybridge, where it's 3. This seems odd, and Agner Fog's tables do sometimes have errors, but it's possible. Micro-fusion doesn't work with some instructions on some microarchitectures.
If either instruction had turned out to be extremely useful for anything important (e.g. inside inner loops), CPU designers would build execution units that could do the whole thing as one uop (or maybe 2 for the memory-dest). But that potentially requires more bits in the internal uop format (which Sandybridge simplified).
Fun fact: _mm_extract_epi32(vec, 0) compiles (on most compilers) to movd eax, xmm0 which is shorter and faster than pextrd eax, xmm0, 0.
Interestingly, they perform differently on Nehalem (which cares a lot of about vector FP vs. integer domains, and came out soon after SSE4.1 was introduced in Penryn (45nm Core2)). EXTRACTPS with a register destination is 1 uop, with 1+2 cycle latency (the +2 from a bypass delay between FP and integer domain). PEXTRD is 2 uops, and runs in 2+1 cycle latency.
From the MSDN docs, I believe you can cast the result to a float.
Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]).
Update: I strongly recommend the answers from #doug65536 and #PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers.
Try _mm_storeu_ps, or any of the variations of SSE store operations.

Instruction/intrinsic for taking higher half of uint64_t in C++?

Imagine following code:
Try it online!
uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);
y receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?
At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi and shr rax, 32 for this, so either CLang doesn't do such optimization, or there exists no such special instruction.
Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg.
If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)
You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)
You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32 to already generate the result in two separate 32-bit registers, instead of imul r64, r64 + shr reg,32, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont or other tuning options, you can't make the compiler do it that way.
shr reg, 32 has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.
Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov in addition to shr if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.
And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465
x86 with BMI2 has rorx rax, rdi, 32 to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar() where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)
There's no intrinsic for rorx; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell miss this optimization.) https://godbolt.org/z/ozjhcc8Te
If a compiler was doing this in a loop, it could have 32 in a register for shrx reg,reg,reg as a copy-and-shift. Or more silly, it could use pext with 0xffffffffULL << 32 as the mask. But that's strictly worse that shrx because of the higher latency.
AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr (bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4 (Excavator) uses bextr rax, rdi, 0x2020, while clang misses that optimization. gcc -march=znver1 uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.
Standard BMI1 bextr needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov+shr is faster on Intel CPUs.

Order of assignment produces different assembly

This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?
#include <inttypes.h>
union pack {
struct {
int32_t i32;
int16_t i16;
};
void *ptr;
};
static_assert(sizeof(pack)==8, "what?");
void *bad(const int32_t i32, const int16_t i16) {
pack p;
p.i32 = i32;
p.i16 = i16;
return p.ptr;
}
void *good(const int32_t i32, const int16_t i16) {
pack p;
p.i16 = i16;
p.i32 = i32;
return p.ptr;
}
...
bad(int, short):
movzx eax, si
sal rax, 32
mov rsi, rax
mov eax, edi
or rax, rsi
ret
good(int, short):
movzx eax, si
mov edi, edi
sal rax, 32
or rax, rdi
ret
The compiler flags were -O3 -fno-rtti -std=c++14
This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)
# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
movzx esi, si # mov-elimination never works for 16->32 movzx
mov eax, edi # mov-elimination works between different regs
sal rsi, 32
or rax, rsi
ret
Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.
Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.
Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.
If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)
Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.
Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.
Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p. But fixing it with pack p = {.ptr=0}; doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).
Also note, both versions in the question are less efficient than possible:
(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)
mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.
hand_written:
movzx eax, si # 16->32 can't be eliminated, only 8->32 and 32->32 mov
shl rax, 32
mov ecx, edi # zero-extend into a different reg with 0 latency
or rax, rcx
ret
Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)
minimal_uops_worse_latency: # also more uops on AMD.
movzx eax, si
shl rdi, 32 # int32 bits to the top of RDI
shld rax, rdi, 32 # shift the high 32 bits of RDI into RAX.
ret
If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.
Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.
high_garbage:
shl rsi, 32 # leaving high garbage = incoming high half of ESI
mov eax, edi # zero-extend into RAX
or rax, rsi
ret
Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)

Optimizing XOR of `char` with the highest byte of `int`

Let we have int i and char c.
When use i ^= c the compiler will XOR c with the lowest byte of i, and will translate the code to the single processor instruction.
When we need to XOR c with highest byte of i we can do something like this:
i ^= c << ((sizeof(i) - sizeof(c)) * 8)
but the compiler will generate two instructions: XOR and BIT-SHIFT.
Is there a way to XOR a char with the highest byte of an int that will be translated to single processor instruction in C++?
If you are confident about the byte-order of the system, for example by checking __BYTE_ORDER__ or equivalent macro on your system, you can do something like this:
#if // Somehow determing if little endian, so biggest byte at the end
*(&reinterpret_cast<char&>(i) + sizeof i - 1) ^= c
#else
// Is big endian, biggest byte at the beginning
reinterpret_cast<char&>(i) ^= c
#endif
Compilers are really smart about such simple arithmetic and bitwise operations. They don't do that simply because they can't, as there are no such instructions on those architectures. It doesn't worth wasting valuable opcode space for rarely used operations like that. Most operations are done in the whole register anyway, and working on just a part of a register is very inefficient for the CPU, because the out-of-order execution or register renaming units will need to work a lot harder. That's the reason why x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register, or why modifying the low part of the register in x86 (like AL or AX) can be slower than modifying the whole RAX. INC can also be slower than ADD 1 because of the partial flag update
That said, there are architectures that can do combined SHIFT and XOR in a single instruction like ARM (eor rX, rX, rY, lsl #24), because ARM designers spent a big part of the instruction encoding for the predication and shifting part, trading off for a smaller number of registers. But again your premise is wrong, because the fact that something can be executed in a single instruction doesn't mean that it'll be faster. Modern CPUs are very complex because every instruction has different latency, throughput and the number of execution ports. For example if a CPU can execute 4 pairs of SHIFT-then-XOR in parallel then obviously it'll be faster than another CPU that can run 4 single SHIFT-XOR instructions sequentially, provided that clock cycle is the same
This is a very typical XY problem, because what you thought is simply the wrong way to do. For operations that need to be done thousands, millions of times or more then it's the job of the GPU or the SIMD unit
For example this is what the Clang compiler emits for a loop XORing the top byte of i with c on an x86 CPU with AVX-512
vpslld zmm0, zmm0, 24 # shift
vpslld zmm1, zmm1, 24
vpslld zmm2, zmm2, 24
vpslld zmm3, zmm3, 24
vpxord zmm0, zmm0, zmmword ptr [rdi + 4*rdx] # xor
vpxord zmm1, zmm1, zmmword ptr [rdi + 4*rdx + 64]
vpxord zmm2, zmm2, zmmword ptr [rdi + 4*rdx + 128]
vpxord zmm3, zmm3, zmmword ptr [rdi + 4*rdx + 192]
By doing that it achieves 16 SHIFT-and-XOR operations with just 2 instruction. Imagine how fast that is. You can unroll more to 32 zmm registers to achieve even higher performance until you saturate the RAM bandwidth. That's why all high-performance architectures have some kind of SIMD which is easier to do fast parallel operations, rather than a useless SHIFT-XOR instruction. Even on ARM with a single-instruction SHIFT-XOR then the compiler will be smart enough to know that SIMD is faster than a series of eor rX, rX, rY, lsl #24. The output is like this
shl v3.4s, v3.4s, 24 # shift
shl v2.4s, v2.4s, 24
shl v1.4s, v1.4s, 24
shl v0.4s, v0.4s, 24
eor v3.16b, v3.16b, v7.16b # xor
eor v2.16b, v2.16b, v6.16b
eor v1.16b, v1.16b, v4.16b
eor v0.16b, v0.16b, v5.16b
Here's a demo for the above snippets
That'll be even faster when running in parallel in multiple cores. The GPU is also able to do very high level or parallelism, hence modern cryptography and intense mathematical problems are often done on the GPU. It can break a password or encrypt a file faster than a general purpose CPU with SIMD
Don't assume that the compiler will generate a shift with the above code. Most modern compilers are smarter than that:
int i = 0; // global in memory
char c = 0;
void foo ()
{
c ^= (i >> 24);
}
Compiles (https://godbolt.org/z/b6l8qk) with GCC (trunk) -O3 to this asm for x86-64:
foo():
movsx eax, BYTE PTR i[rip+3] # sign-extending byte load
xor BYTE PTR c[rip], al # memory-destination byte xor
ret
clang, ICC, and MSVC all emit equivalent code (some using a movzx load which is more efficient on some CPUs than movsx).
This only really helps if the int is in memory, not a register, which is hopefully not the case in a tight loop unless it's different integers (like in an array).

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

So far I have the following:
bool IsZero(const double x) {
return fabs(x) == +0.0;
}
Is this the fastest of correct ways to compare to exact 0, while both +0.0 and -0.0 are accepted?
If CPU-specific, lets consider x86-64. If compiler specific, lets consider MSVC++2017 toolset v141.
Since you said you want the fastest possible code, I'm going to make some important simplifying assumptions throughout this answer. These are legal, per the question. In particular, I'm assuming x86 and IEEE-754 representations of floating-point values. I'll also mention MSVC-specific quirks, where applicable, although the general discussion would apply to any compiler targeting this architecture.
The way you test whether a floating-point value is equal to zero is by testing all of its bits. If all of the bits are 0, then the value is zero. Actually, the value is +0.0. The sign bit can be either 0 or 1, since the representation allows such thing as positive and negative 0.0, as you mention in the question. But this difference doesn't actually exist (there's not really any such thing as +0.0 and −0.0), so what you really need is to test all bits except the sign bit.
This can be done quickly and efficiently with some bit-twiddling. On little-endian architectures like x86, the sign bit is the leading bit, so you simply shift it out and then test the remaining bits.
This trick is described by Agner Fog in his Optimizing Subroutines in Assembly Language. Specifically, example 17.4b (on page 156 in the current version).
For a single-precision floating-point value (i.e., float), which is 32-bits wide:
mov eax, DWORD PTR [floatingPointValue]
add eax, eax ; shift out the sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
Translating this into C code, you'd do something like:
const uint32_t bits = *(reinterpret_cast<uint32_t*>(&value));
return ((bits + bits) == 0);
Of course, this is formally unsafe because of the type punning. MSVC lets you get away with it, no problem. In fact, if you try to actually conform to the standard and play it safe, MSVC will tend to generate less efficient code, decreasing the effectiveness of this trick. If you want to do this safely, you'll need to verify the output of your compiler and make sure it's doing what you want. Some assertions are also recommended.
If you're okay with the unsafe nature of this approach, you will find that it is faster than a poorly-predicted conditional branch, so when you're dealing with random input values, it might be a performance win. For comparison purposes, here is what you'll see from MSVC if you just do a naive test for equality against 0.0:
;; assuming /arch:IA32, which is *not* the default in modern versions of MSVC
;; but necessary if you cannot assume SSE2 support
fld DWORD PTR [floatingPointValue]
fldz
fucompp
fnstsw ax
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
;; assuming /arch:SSE2, which *is* the default in modern versions of MSVC
movss xmm0, DWORD PTR [floatingPointValue]
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
Ugly, and potentially slow. There are branchless ways of doing this, but MSVC won't use them.
An obvious drawback to the "optimized" implementation described above is that it requires the floating-point value be loaded from memory in order to access its bits. There are no x87 instructions that can access the bits directly, and there's no way go directly from an x87 register to a GP register without going through memory. Since memory access is slow, this does incur a performance penalty, but in my tests, it's still faster than a mispredicted branch.
If you're using any of the standard calling conventions on 32-bit x86 (__cdecl, __stdcall, etc.), then all floating-point values are passed and returned in the x87 registers, so there's no difference in moving from an x87 register to a GP register versus moving from an x87 register to an SSE register.
The story is a bit different if you're targeting x86-64 or if you are using __vectorcall on x86-32. Then, you actually have floating-point values stored and passed in SSE registers, so you can take advantage of branchless SSE instructions. At least, theoretically. MSVC won't, unless you hold its hand. It would normally do the same branching comparison shown above, just without the extra memory load:
;; MSVC output for a __vectorcall function, targeting x86-32 with /arch:SSE2
;; and/or for x86-64 (which always uses a vector calling convention and SSE2)
;; The floating point value being compared is passed directly in XMM0
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
I've demonstrated the compiler output for a very simple bool IsZero(float val) function, but in my observations, MSVC always emits a UCOMISS+JP sequence for this type of comparison, no matter how the comparison is incorporated into the input code. Again, fine if the zero-ness of the input is predictable, but relatively lousy if branch prediction fails.
If you want to ensure you get branchless code, avoiding the possibility of branch-misprediction stalls, then you need to use intrinsics to do the comparison. These intrinsics will force MSVC to emit code closer to what you would expect:
return (_mm_ucomieq_ss(_mm_set_ss(floatingPointValue), _mm_setzero_ps()) != 0);
Unfortunately, the output is still not perfect. You suffer from general optimization deficiencies surrounding the use of intrinsics—namely, some redundant shuffling of the input value between various SSE registers—but that is (A) unavoidable, and (B) not a measurable performance problem.
I'll note here that other compilers, like Clang and GCC, don't need their hands held. You can just do value == 0.0. The exact sequence of code that they emit varies, depending on your optimization settings, but you'll see either COMISS+SETE, UCOMISS+SETNP+CMOVNE or CMPEQSS+MOVD+NEG (the latter is used exclusively by ICC). Your attempting to hold their hands with intrinsics would almost certainly result in less efficient output, so this probably needs to be #ifdef'ed to limit it to MSVC.
That's single-precision values, which have a width of 32 bits. What about double-precision values, which are twice as long? You'd think these would have 63 bits to test (since the sign bit is still ignored), but there's a twist. If you can rule out the possibility of denormal numbers, then you can get away with testing only the upper bits (again, assuming little-endian).
Agner Fog discusses this as well (example 17.4d). If you exclude the possibility of denormal numbers, then a value of 0 corresponds to the case where the exponent bits are all 0. The upper bits are the sign bit and the exponent bits, so you can just test these exactly as you did for single-precision values:
mov eax, DWORD PTR [floatingPointValue+4] ; load upper bits only
add eax, eax ; shift out sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
In unsafe C:
const uint64_t bits = *(reinterpret_cast<uint64_t*>(&value);
const uint32_t upperBits = (bits & 0xFFFFFFFF00000000) >> 32;
return ((upperBits + upperBits) == 0);
If you do need to account for denormal values, then you aren't saving yourself anything. I haven't tested this, but you're probably no worse letting the compiler generate the code for a naive comparison. At least, not for x86-32. You might still gain on x86-64, where you have 64-bit-wide GP registers.
If you can assume SSE2 support (which would be all x86-64 systems, and all modern x86-32 builds as well), then you just use intrinsics, and you get denormal support for free (well, not really free; there are internal penalties in the CPU, I believe, but we'll ignore those):
return (_mm_ucomieq_sd(_mm_set_sd(floatingPointValue), _mm_setzero_pd()) != 0);
Again, as with single-precision values, the use of intrinsics is not necessary on compilers other than MSVC to get optimal code, and indeed may result in sub-optimal code, so should be avoided.
In plain and simple words, if you want to accept exactly +0.0 and -0.0, you just have to use:
x == 0.0
OR
From the cmath library you can use:
int fpclassify( double arg ) which will return "zero" for -0.0 or +0.0
If you open the assembler of the code you can find what kind of assembler instructions are used for different versions of your code. Having the assembler you can estimate which is better.
In GCC compiler you can keep intermediate files (including assembler version) by this way:
gcc -save-temps main.cpp