Instruction/intrinsic for taking higher half of uint64_t in C++? - c++

Imagine following code:
Try it online!
uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);
y receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?
At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi and shr rax, 32 for this, so either CLang doesn't do such optimization, or there exists no such special instruction.
Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg.

If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)
You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)
You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32 to already generate the result in two separate 32-bit registers, instead of imul r64, r64 + shr reg,32, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont or other tuning options, you can't make the compiler do it that way.
shr reg, 32 has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.
Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov in addition to shr if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.
And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465
x86 with BMI2 has rorx rax, rdi, 32 to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar() where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)
There's no intrinsic for rorx; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell miss this optimization.) https://godbolt.org/z/ozjhcc8Te
If a compiler was doing this in a loop, it could have 32 in a register for shrx reg,reg,reg as a copy-and-shift. Or more silly, it could use pext with 0xffffffffULL << 32 as the mask. But that's strictly worse that shrx because of the higher latency.
AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr (bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4 (Excavator) uses bextr rax, rdi, 0x2020, while clang misses that optimization. gcc -march=znver1 uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.
Standard BMI1 bextr needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov+shr is faster on Intel CPUs.

Related

Reinterpret casting from __m256i to __m256 [duplicate]

Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
None of the answers appear to actually answer the question, why does it return int.
The reason is, the extractps instruction actually copies a component of the vector to a general register. It does seem pretty silly for it to return an int but that's what's actually happening - the raw floating point value ends up in a general register (which hold integers).
If your compiler is configured to generate SSE for all floating point operations, then the closest thing to "extracting" a value to a register would be to shuffle the value into the low component of the vector, then cast it to a scalar float. This should cause that component of the vector to remain in an SSE register:
/* returns the second component of the vector */
float foo(__m128 b)
{
return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));
}
The _mm_cvtss_f32 intrinsic is free, it does not generate instructions, it only makes the compiler reinterpret the xmm register as a float so it can be returned as such.
The _mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro generates an immediate operand for the resulting shufps instruction.
The 2 in the example gets the float from bit 95:64 of the 127:0 register (the 3rd 32 bit component from the beginning, in memory order) and places it in the 31:0 component of the register (the beginning, in memory order).
The resulting generated code will most likely return the value naturally in a register, like any other floating point value return, with no inefficient writing out to memory and reading it back.
If you're generating code that uses the x87 FPU for floating point (for normal C code that isn't SSE optimized), this would probably result in inefficient code being generated - the compiler would probably store out the component of the SSE vector then use fld to read it back into the x87 register stack. In general 64-bit platforms don't use x87 (they use SSE for all floating point, mostly scalar instructions unless the compiler is vectorizing).
I should add that I always use C++, so I'm not sure whether it is more efficient to pass __m128 by value or by pointer in C. In C++ I would use a const __m128 & and this kind of code would be in a header, so the compiler can inline.
Confusingly, int _mm_extract_ps() is not for getting a scalar float element from a vector. The intrinsic doesn't expose the memory-destination form of the instruction (which can be useful for that purpose). This is not the only case where the intrinsics can't directly express everything an instruction is useful for. :(
gcc and clang know how the asm instruction works and will use it that way for you when compiling other shuffles; type-punning the _mm_extract_ps result to float usually results in horrible asm from gcc (extractps eax, xmm0, 2 / mov [mem], eax).
The name makes sense if you think of _mm_extract_ps as extracting an IEEE 754 binary32 float bit pattern out of the FP domain of the CPU into the integer domain (as a C scalar int), instead of manipulating FP bit patterns with integer vector ops. According to my testing with gcc, clang, and icc (see below), this is the only "portable" use-case where _mm_extract_ps compiles into good asm across all compilers. Anything else is just a compiler-specific hack to get the asm you want.
The corresponding asm instruction is EXTRACTPS r/m32, xmm, imm8. Notice that the destination can be memory or an integer register, but not another XMM register. It's the FP equivalent of PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer-register-destination form is more obviously useful. EXTRACTPS is not the reverse of INSERTPS xmm1, xmm2/m32, imm8.
Perhaps this similarity with PEXTRD makes the internal implementation simpler without hurting the extract-to-memory use-case (for asm, not intrinsics), or maybe the SSE4.1 designers at Intel thought it was actually more useful this way than as a non-destructive FP-domain copy-and-shuffle (which x86 seriously lacks without AVX). There are FP-vector instructions that have an XMM source and a memory-or-xmm destination, like MOVSS xmm2/m32, xmm, so this kind of instruction would not be new. Fun fact: the opcodes for PEXTRD and EXTRACTPS differ only in the last bit.
In assembly, a scalar float is just the low element of an XMM register (or 4 bytes in memory). The upper elements of the XMM don't even have to be zeroed for instructions like ADDSS to work without raising any extra FP exceptions. In calling conventions that pass/return FP args in XMM registers (e.g. all the usual x86-64 ABIs), float foo(float a) must assume that the upper elements of XMM0 hold garbage on entry, but can leave garbage in the high elements of XMM0 on return. (More info).
As #doug points out, other shuffle instructions can be used to get a float element of a vector into the bottom of an xmm register. This was already a mostly-solved problem in SSE1/SSE2, and it seems EXTRACTPS and INSERTPS weren't trying to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in a register and they can't/don't optimize away zeroing the upper elements. (Which is most of the time for compilers other than clang). That linked question also further discusses the failure of intrinsics to expose the load or store versions of instructions like EXTRACTPS, INSERTPS, and PMOVZX that have a memory operand narrower than 128b (thus not requiring alignment even without AVX). It can be impossible to write safe code that compiles as efficiently as what you can do in asm.
Without AVX 3-operand SHUFPS, x86 doesn't provide a fully efficient and general-purpose way to copy-and-shuffle an FP vector the way integer PSHUFD can. SHUFPS is a different beast unless used in-place with src=dst. Preserving the original requires a MOVAPS, which costs a uop and latency on CPUs before IvyBridge, and always costs code-size. Using PSHUFD between FP instructions costs latency (bypass delays). (See this horizontal-sum answer for some tricks, like using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element into a separate register, but AFAIK it still has a dependency on the previous value of the destination even when all the original values are replaced. False dependencies like that are bad for out-of-order execution. xor-zeroing a register as a destination for INSERTPS would still be 2 uops, and have lower latency than MOVAPS+SHUFPS on SSE4.1 CPUs without mov-elimination for zero-latency MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you include low-power CPUs). The code-size is slightly worse, though.
Using _mm_extract_ps and then type-punning the result back to float (as suggested in the currently-accepted answer and its comments) is a bad idea. It's easy for your code to compile to something horrible (like EXTRACTPS to memory and then load back into an XMM register) on either gcc or icc. Clang seems to be immune to braindead behaviour and does its usual shuffle-compiling with its own choice of shuffle instructions (including appropriate use of EXTRACTPS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell, clang3.8.1, and icc17, on the Godbolt compiler explorer. I used C mode, not C++, but union-based type punning is allowed in GNU C++ as an extension to ISO C++. Pointer-casting for type-punning violates strict aliasing in C99 and C++, even with GNU extensions.
#include <immintrin.h>
// gcc:bad clang:good icc:good
void extr_unsafe_ptrcast(__m128 v, float *p) {
// violates strict aliasing
*(int*)p = _mm_extract_ps(v, 2);
}
gcc: # others extractps with a memory dest
extractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:bad
void extr_pun(__m128 v, float *p) {
// union type punning is safe in C99 (and GNU C and GNU C++)
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
*p = fp.f; // compiles to an extractps straight to memory
}
icc:
vextractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:horrible
void extr_gnu(__m128 v, float *p) {
// gcc uses extractps with a memory dest, icc does extr_store
*p = v[2];
}
gcc/clang:
extractps DWORD PTR [rdi], xmm0, 2
icc:
vmovups XMMWORD PTR [-24+rsp], xmm0
mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:poor
void extr_shuf(__m128 v, float *p) {
__m128 e2 = _mm_shuffle_ps(v,v, 2);
*p = _mm_cvtss_f32(e2); // gcc uses extractps
}
icc: (others: extractps right to memory)
vshufps xmm1, xmm0, xmm0, 2
vmovss DWORD PTR [rdi], xmm1
When you want the final result in an xmm register, it's up to the compiler to optimize away your extractps and do something completely different. Gcc and clang both succeed, but ICC doesn't.
// gcc:good clang:good icc:bad
float ret_pun(__m128 v) {
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
return fp.f;
}
gcc:
unpckhps xmm0, xmm0
clang:
shufpd xmm0, xmm0, 1
icc17:
vextractps DWORD PTR [-8+rsp], xmm0, 2
vmovss xmm0, DWORD PTR [-8+rsp]
Note that icc did poorly for extr_pun, too, so it doesn't like union-based type-punning for this.
The clear winner here is doing the shuffle "manually" with _mm_shuffle_ps(v,v, 2), and using _mm_cvtss_f32. We got optimal code from every compiler for both register and memory destinations, except for ICC which failed to use EXTRACTPS for the memory-dest case. With AVX, SHUFPS + separate store is still only 2 uops on Intel CPUs, just larger code size and needs a tmp register. Without AVX, though, it would cost a MOVAPS to not destroy the original vector :/
According to Agner Fog's instruction tables, all Intel CPUs except Nehalem implement the register-destination versions of both PEXTRD and EXTRACTPS with multiple uops: Usually just a shuffle uop + a MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS is 1 uop for port 5, with 1+2 cycle latency (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop but not PEXTRD (which is 2 uops, and runs in 2+1 cycle latency). Nehalem MOVD is 1 uop (and runs on any ALU port), with 1+1 cycle latency. (The +1 is for the bypass delay between vec-int and general-purpose integer regs, I think).
Nehalem cares a lot of about vector FP vs. integer domains; SnB-family CPUs have smaller (sometimes zero) bypass delay latencies between domains.
The memory-dest versions of PEXTRD and EXTRACTPS are both 2 uops on Nehalem.
On Broadwell and later, memory-destination EXTRACTPS and PEXTRD are 2 uops, but on Sandybridge through Haswell, memory-destination EXTRACTPS is 3 uops. Memory-destination PEXTRD is 2 uops on everything except Sandybridge, where it's 3. This seems odd, and Agner Fog's tables do sometimes have errors, but it's possible. Micro-fusion doesn't work with some instructions on some microarchitectures.
If either instruction had turned out to be extremely useful for anything important (e.g. inside inner loops), CPU designers would build execution units that could do the whole thing as one uop (or maybe 2 for the memory-dest). But that potentially requires more bits in the internal uop format (which Sandybridge simplified).
Fun fact: _mm_extract_epi32(vec, 0) compiles (on most compilers) to movd eax, xmm0 which is shorter and faster than pextrd eax, xmm0, 0.
Interestingly, they perform differently on Nehalem (which cares a lot of about vector FP vs. integer domains, and came out soon after SSE4.1 was introduced in Penryn (45nm Core2)). EXTRACTPS with a register destination is 1 uop, with 1+2 cycle latency (the +2 from a bypass delay between FP and integer domain). PEXTRD is 2 uops, and runs in 2+1 cycle latency.
From the MSDN docs, I believe you can cast the result to a float.
Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]).
Update: I strongly recommend the answers from #doug65536 and #PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers.
Try _mm_storeu_ps, or any of the variations of SSE store operations.

GCC w/ inline assembly & -Ofast generating extra code for memory operand

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.
I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.
When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.
I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).
typedef uint64_t xmm2q __attribute__ ((vector_size (16)));
// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };
xmm2q mask2b(uint64_t mask) {
assert(mask < 4);
#ifdef USE_ASM
xmm2q result;
asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
return result;
#else
// bad cast (UB?), but input should be `uint16_t*` anyways
return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
#endif
}
Output assembly with -S (with USE_ASM and without):
__Z6mask2by: ## #_Z6mask2by
.cfi_startproc
## %bb.0:
leaq __ZL10MASK_TABLE(%rip), %rax
vpmovsxbq (%rax,%rdi,2), %xmm0
retq
.cfi_endproc
What I was expecting (I've removed all the extra stuff):
__Z6mask2by:
vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
retq
The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.
(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).
If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.
In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.
Intrinsics
Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.
*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.
In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.
The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV
I've written about this before:
Loading 8 chars from memory into an __m256 variable as packed single precision floats
How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
Your integer -> vector strategy
Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.
A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into
is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).
If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.
You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

How to set MMX registers in a Windows exception handler to emulate unsupported 3DNow! instructions

I'm trying to revive an old Win32 game that uses 3DNow! instruction set to make 3D rendering.
On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed and the program throws an exception.
Since the number of 3DNow! instuctions used by the game is very limited, in my VS2008 MFC program I tried to use vectored exception handling to get the value of MMX registers, emulate the 3DNow! instructions by C code and push the values back to the processor 3DNow! registers.
So far I succeeded in first two steps (I get mmx register values from ExceptionInfo->ExtendedRegisters byte array at offset 32 and use float type C instructions to make calculations), but my problem is that, no matter how I try to update the MMX register values the register values seem to stay unchanged.
Assuming that my _asm statements might be wrong, I did also some minimal test using simple statements like this:
_asm movq mm0 mm7
This statement is executed without further exceptions, but when retrieving the MMX register values I still find that the original values were unchanged.
How can I make the assignment effective?
On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed
More likely your CPU doesn't support 3DNow! AMD dropped it for Bulldozer-family, and Intel never supported it. So unless you're running modern Windows on an Athlon64 / Phenom (or a Via C3), your CPU doesn't support it.
(Fun fact: PREFETCHW was originally a 3DNow! instruction, and is still supported (with its own CPUID feature bit). For a long time Intel CPUs ran it as a NOP, but Broadwell and later (IIRC) do actually prefetch a cache line into Exclusive state with a Read-For-Ownership.)
Unless this game only ever ran on AMD hardware, it must have a code path that avoids 3DNow. Fix its CPU detection to stop detecting your CPU as having 3DNow. (Maybe you have a recent AMD, and it assumes any AMD has 3DNow?)
(update on that: OP's comments say that the other code paths don't work for some reason. That's a problem.)
Returning from an exception handler probably restores registers from saved state, so it's not surprising that changing register values in the exception handler has no effect on the main program.
Apparently updating ExtendedRegisters in memory doesn't do the trick, though, so that's only a copy of the saved state.
The answer to modifying MMX registers from an exception handler is probably the same as for integer or XMM registers, so look up MS's documentation for that.
Alternative suggestion:
Rewrite the 3DNow code to use SSE2. (You said there's only a tiny amount of it?). SSE2 is baseline for x86-64, and generally safe to assume for 32-bit x86.
Without source, you could still modify the asm for the few functions that use 3DNow. You can literally just change the instructions to use 64-bit loads/stores into XMM registers instead of 3DNow! 64-bit loads/stores, and replace PFMUL with mulps, etc. (This could get slightly hairy if you run out of registers and the 3DNow code used a memory source operand. addps xmm0, [mem] requires 16B-aligned memory, and does a 16 byte load. So you may have to add a spill/reload to borrow another register as a temporary).
If you don't have room to rewrite the functions in-place, put in a jmp to somewhere you do have room to add new code.
Most of the 3DNow instructions have equivalents in SSE, but you may need some extra movaps instructions to copy registers around to implement PFCMPGE. If you can ignore the possibility of NaN, you can use cmpps with a not-less-than predicate. (Without AVX, SSE only has compare predicates based on less-than or not-less-than).
PFSUBR is easy to emulate with a spare register, just copy and subps to reverse. (Or SUBPS and invert the sign with XORPS). PFRCPIT1 (reciprocal-sqrt first iteration of refinement) and so on don't have a single-instruction implementation, but you can probably just use sqrtps and divps if you don't want to implement Newton-Raphson iterations with mulps and addps (or with AVX vfmadd). Modern CPUs are much faster than what this game was designed for.
You can load / store a pair of single-precision floats from/to memory into the bottom 64 bits of an XMM register using movsd (the SSE2 double-precision load/store instruction). You can also store a pair with movlps, but still use movsd for loading because it zeros the upper half instead of merging, so it doesn't have a dependency on the old value of the register.
Use movdq2q mm0, xmm0 and movq2dq xmm0, mm0 to move data between XMM and MMX.
Use movaps xmm1, xmm0 to copy registers, even if your data is only in the low half. (movsd xmm1, xmm0 merges the low half into the original high half. movq xmm1, xmm0 zeros the high half.)
addps and mulps work fine with zeros in the upper half. (They can slow down if any garbage (in the upper half) produces a denormal result, so prefer keeping the upper half zeroed). See http://felixcloutier.com/x86/ for an instruction-set reference (and other links in the x86 tag wiki.
Any shuffling of FP data can be done in XMM registers with shufps or pshufd instead of copying back to MMX registers to use whatever MMX shuffles.

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

So far I have the following:
bool IsZero(const double x) {
return fabs(x) == +0.0;
}
Is this the fastest of correct ways to compare to exact 0, while both +0.0 and -0.0 are accepted?
If CPU-specific, lets consider x86-64. If compiler specific, lets consider MSVC++2017 toolset v141.
Since you said you want the fastest possible code, I'm going to make some important simplifying assumptions throughout this answer. These are legal, per the question. In particular, I'm assuming x86 and IEEE-754 representations of floating-point values. I'll also mention MSVC-specific quirks, where applicable, although the general discussion would apply to any compiler targeting this architecture.
The way you test whether a floating-point value is equal to zero is by testing all of its bits. If all of the bits are 0, then the value is zero. Actually, the value is +0.0. The sign bit can be either 0 or 1, since the representation allows such thing as positive and negative 0.0, as you mention in the question. But this difference doesn't actually exist (there's not really any such thing as +0.0 and −0.0), so what you really need is to test all bits except the sign bit.
This can be done quickly and efficiently with some bit-twiddling. On little-endian architectures like x86, the sign bit is the leading bit, so you simply shift it out and then test the remaining bits.
This trick is described by Agner Fog in his Optimizing Subroutines in Assembly Language. Specifically, example 17.4b (on page 156 in the current version).
For a single-precision floating-point value (i.e., float), which is 32-bits wide:
mov eax, DWORD PTR [floatingPointValue]
add eax, eax ; shift out the sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
Translating this into C code, you'd do something like:
const uint32_t bits = *(reinterpret_cast<uint32_t*>(&value));
return ((bits + bits) == 0);
Of course, this is formally unsafe because of the type punning. MSVC lets you get away with it, no problem. In fact, if you try to actually conform to the standard and play it safe, MSVC will tend to generate less efficient code, decreasing the effectiveness of this trick. If you want to do this safely, you'll need to verify the output of your compiler and make sure it's doing what you want. Some assertions are also recommended.
If you're okay with the unsafe nature of this approach, you will find that it is faster than a poorly-predicted conditional branch, so when you're dealing with random input values, it might be a performance win. For comparison purposes, here is what you'll see from MSVC if you just do a naive test for equality against 0.0:
;; assuming /arch:IA32, which is *not* the default in modern versions of MSVC
;; but necessary if you cannot assume SSE2 support
fld DWORD PTR [floatingPointValue]
fldz
fucompp
fnstsw ax
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
;; assuming /arch:SSE2, which *is* the default in modern versions of MSVC
movss xmm0, DWORD PTR [floatingPointValue]
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
Ugly, and potentially slow. There are branchless ways of doing this, but MSVC won't use them.
An obvious drawback to the "optimized" implementation described above is that it requires the floating-point value be loaded from memory in order to access its bits. There are no x87 instructions that can access the bits directly, and there's no way go directly from an x87 register to a GP register without going through memory. Since memory access is slow, this does incur a performance penalty, but in my tests, it's still faster than a mispredicted branch.
If you're using any of the standard calling conventions on 32-bit x86 (__cdecl, __stdcall, etc.), then all floating-point values are passed and returned in the x87 registers, so there's no difference in moving from an x87 register to a GP register versus moving from an x87 register to an SSE register.
The story is a bit different if you're targeting x86-64 or if you are using __vectorcall on x86-32. Then, you actually have floating-point values stored and passed in SSE registers, so you can take advantage of branchless SSE instructions. At least, theoretically. MSVC won't, unless you hold its hand. It would normally do the same branching comparison shown above, just without the extra memory load:
;; MSVC output for a __vectorcall function, targeting x86-32 with /arch:SSE2
;; and/or for x86-64 (which always uses a vector calling convention and SSE2)
;; The floating point value being compared is passed directly in XMM0
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
I've demonstrated the compiler output for a very simple bool IsZero(float val) function, but in my observations, MSVC always emits a UCOMISS+JP sequence for this type of comparison, no matter how the comparison is incorporated into the input code. Again, fine if the zero-ness of the input is predictable, but relatively lousy if branch prediction fails.
If you want to ensure you get branchless code, avoiding the possibility of branch-misprediction stalls, then you need to use intrinsics to do the comparison. These intrinsics will force MSVC to emit code closer to what you would expect:
return (_mm_ucomieq_ss(_mm_set_ss(floatingPointValue), _mm_setzero_ps()) != 0);
Unfortunately, the output is still not perfect. You suffer from general optimization deficiencies surrounding the use of intrinsics—namely, some redundant shuffling of the input value between various SSE registers—but that is (A) unavoidable, and (B) not a measurable performance problem.
I'll note here that other compilers, like Clang and GCC, don't need their hands held. You can just do value == 0.0. The exact sequence of code that they emit varies, depending on your optimization settings, but you'll see either COMISS+SETE, UCOMISS+SETNP+CMOVNE or CMPEQSS+MOVD+NEG (the latter is used exclusively by ICC). Your attempting to hold their hands with intrinsics would almost certainly result in less efficient output, so this probably needs to be #ifdef'ed to limit it to MSVC.
That's single-precision values, which have a width of 32 bits. What about double-precision values, which are twice as long? You'd think these would have 63 bits to test (since the sign bit is still ignored), but there's a twist. If you can rule out the possibility of denormal numbers, then you can get away with testing only the upper bits (again, assuming little-endian).
Agner Fog discusses this as well (example 17.4d). If you exclude the possibility of denormal numbers, then a value of 0 corresponds to the case where the exponent bits are all 0. The upper bits are the sign bit and the exponent bits, so you can just test these exactly as you did for single-precision values:
mov eax, DWORD PTR [floatingPointValue+4] ; load upper bits only
add eax, eax ; shift out sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
In unsafe C:
const uint64_t bits = *(reinterpret_cast<uint64_t*>(&value);
const uint32_t upperBits = (bits & 0xFFFFFFFF00000000) >> 32;
return ((upperBits + upperBits) == 0);
If you do need to account for denormal values, then you aren't saving yourself anything. I haven't tested this, but you're probably no worse letting the compiler generate the code for a naive comparison. At least, not for x86-32. You might still gain on x86-64, where you have 64-bit-wide GP registers.
If you can assume SSE2 support (which would be all x86-64 systems, and all modern x86-32 builds as well), then you just use intrinsics, and you get denormal support for free (well, not really free; there are internal penalties in the CPU, I believe, but we'll ignore those):
return (_mm_ucomieq_sd(_mm_set_sd(floatingPointValue), _mm_setzero_pd()) != 0);
Again, as with single-precision values, the use of intrinsics is not necessary on compilers other than MSVC to get optimal code, and indeed may result in sub-optimal code, so should be avoided.
In plain and simple words, if you want to accept exactly +0.0 and -0.0, you just have to use:
x == 0.0
OR
From the cmath library you can use:
int fpclassify( double arg ) which will return "zero" for -0.0 or +0.0
If you open the assembler of the code you can find what kind of assembler instructions are used for different versions of your code. Having the assembler you can estimate which is better.
In GCC compiler you can keep intermediate files (including assembler version) by this way:
gcc -save-temps main.cpp

Micro optimize pointer + unsigned + 1

Hard as it may be to believe the construct p[u+1] occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.
Typically *((p+u)+1) is most efficient. Sometimes *(p+(u+1)) is most efficient. Rarely *((p+1)+u) is best. (But usually an optimizer can convert *((p+1)+u) to *((p+u)+1) when the latter is better, and can't convert *(p+(u+1)) with either of the others).
p is a pointer and u is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.
In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.
In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u from 32-bit to 64-bit can affect the result.
What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?
In one Linux 64-bit compiler, I got nearly there with p[u+1L] which causes the compiler to select intelligently between the usually best *((p+u)+1) and the sometimes better *(p+( (long)(u) + 1) ). In the rare case *(p+(u+1)) was still better than the second of those, a little is lost.
Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL] is portable enough and good enough. But can I do better?
Note that using std::size_t instead of unsigned for u would eliminate this entire problem, but create a larger performance problem nearby. Casting u to std::size_t right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.
Simply coding (p+1)[u] makes a selection more likely to be optimal than p[u+1]. If the code were less templated and more stable, I could set them all to (p+1)[u] then profile then switch a few back to p[u+1]. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).
Compilers that should be efficient for this are GCC, ICC and MSVC.
The answer is inevitably compiler and target specific, but even if 1ULL is wider than a pointer on whatever target architecture, a good compiler should optimize it away. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? explains why a wider computation truncated to pointer width will give identical results as doing computation with pointer width in the first place. This is why compilers can optimize it away even on 32bit machines (or x86-64 with the x32 ABI) when 1ULL leads to promotion of the + operands to a 64bit type. (Or on some 64bit ABI for some architecture where long long is 128b).
1ULL looks optimal for 64bit, and for 32bit with clang. You don't care about 32bit anyway, but gcc wastes an instruction in the return p[u + 1ULL];. All the other cases are compiled to a single load with scaled-index+4+p addressing mode. So other than one compiler's optimization failure, 1ULL looks fine for 32bit as well. (I think it's unlikely that it's a clang bug and that optimization is illegal).
int v1ULL(std::uint32_t u) { return p[u + 1ULL]; }
// ... load u from the stack
// add eax, 1
// mov eax, DWORD PTR p[0+eax*4]
instead of
mov eax, DWORD PTR p[4+eax*4]
Interestingly, gcc 5.3 doesn't make this mistake when targeting the x32 ABI (long mode with 32bit pointers and a register-call ABI similar to SySV AMD64). It uses a 32bit address-size prefix to avoid using the upper 32b of edi.
Annoyingly, it still uses an address-size prefix when it could save a byte of machine code by using a 64bit effective address (when there's no chance of overflow/carry into the upper32 generating an address outside the low 4GiB). Passing the pointer by reference is a good example:
int x2 (char *&c) { return *c; }
// mov eax, DWORD PTR [edi] ; upper32 of rax is zero
// movsx eax, BYTE PTR [eax] ; could be byte [rax], saving one byte of machine code
Err, actually I forget. 32bit addresses might sign-extend to 64b, not zero-extend. If that's the case, it could have used movsx for the first instruction, too, but that would have cost a byte because movsx has a longer opcode than mov.
Anyway, x32 is still an interesting choice for pointer-heavy code that wants more registers and a nicer ABI, without the cache-miss hit of 8B pointers.
The 64bit asm has to zero the upper32 of the register holding the parameter (with mov edi,edi), but that goes away when inlining. Looking at godbolt output for tiny functions is a valid way to test this.
If we want to make doubly sure that the compiler isn't shooting itself in the foot and zeroing the upper32 when it should know it's already zero, we could make test functions with an arg passed by reference.
int v1ULL(const std::uint32_t &u) { return p[u + 1ULL]; }
// mov eax, DWORD PTR [rdi]
// mov eax, DWORD PTR p[4+rax*4]