GCC w/ inline assembly & -Ofast generating extra code for memory operand - c++

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.
I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.
When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.
I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).
typedef uint64_t xmm2q __attribute__ ((vector_size (16)));
// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };
xmm2q mask2b(uint64_t mask) {
assert(mask < 4);
#ifdef USE_ASM
xmm2q result;
asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
return result;
#else
// bad cast (UB?), but input should be `uint16_t*` anyways
return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
#endif
}
Output assembly with -S (with USE_ASM and without):
__Z6mask2by: ## #_Z6mask2by
.cfi_startproc
## %bb.0:
leaq __ZL10MASK_TABLE(%rip), %rax
vpmovsxbq (%rax,%rdi,2), %xmm0
retq
.cfi_endproc
What I was expecting (I've removed all the extra stuff):
__Z6mask2by:
vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
retq

The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.
(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).
If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.
In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.
Intrinsics
Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.
*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.
In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.
The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV
I've written about this before:
Loading 8 chars from memory into an __m256 variable as packed single precision floats
How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?
Your integer -> vector strategy
Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.
A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into
is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).
If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.
You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

Related

Reinterpret casting from __m256i to __m256 [duplicate]

Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
None of the answers appear to actually answer the question, why does it return int.
The reason is, the extractps instruction actually copies a component of the vector to a general register. It does seem pretty silly for it to return an int but that's what's actually happening - the raw floating point value ends up in a general register (which hold integers).
If your compiler is configured to generate SSE for all floating point operations, then the closest thing to "extracting" a value to a register would be to shuffle the value into the low component of the vector, then cast it to a scalar float. This should cause that component of the vector to remain in an SSE register:
/* returns the second component of the vector */
float foo(__m128 b)
{
return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));
}
The _mm_cvtss_f32 intrinsic is free, it does not generate instructions, it only makes the compiler reinterpret the xmm register as a float so it can be returned as such.
The _mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro generates an immediate operand for the resulting shufps instruction.
The 2 in the example gets the float from bit 95:64 of the 127:0 register (the 3rd 32 bit component from the beginning, in memory order) and places it in the 31:0 component of the register (the beginning, in memory order).
The resulting generated code will most likely return the value naturally in a register, like any other floating point value return, with no inefficient writing out to memory and reading it back.
If you're generating code that uses the x87 FPU for floating point (for normal C code that isn't SSE optimized), this would probably result in inefficient code being generated - the compiler would probably store out the component of the SSE vector then use fld to read it back into the x87 register stack. In general 64-bit platforms don't use x87 (they use SSE for all floating point, mostly scalar instructions unless the compiler is vectorizing).
I should add that I always use C++, so I'm not sure whether it is more efficient to pass __m128 by value or by pointer in C. In C++ I would use a const __m128 & and this kind of code would be in a header, so the compiler can inline.
Confusingly, int _mm_extract_ps() is not for getting a scalar float element from a vector. The intrinsic doesn't expose the memory-destination form of the instruction (which can be useful for that purpose). This is not the only case where the intrinsics can't directly express everything an instruction is useful for. :(
gcc and clang know how the asm instruction works and will use it that way for you when compiling other shuffles; type-punning the _mm_extract_ps result to float usually results in horrible asm from gcc (extractps eax, xmm0, 2 / mov [mem], eax).
The name makes sense if you think of _mm_extract_ps as extracting an IEEE 754 binary32 float bit pattern out of the FP domain of the CPU into the integer domain (as a C scalar int), instead of manipulating FP bit patterns with integer vector ops. According to my testing with gcc, clang, and icc (see below), this is the only "portable" use-case where _mm_extract_ps compiles into good asm across all compilers. Anything else is just a compiler-specific hack to get the asm you want.
The corresponding asm instruction is EXTRACTPS r/m32, xmm, imm8. Notice that the destination can be memory or an integer register, but not another XMM register. It's the FP equivalent of PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer-register-destination form is more obviously useful. EXTRACTPS is not the reverse of INSERTPS xmm1, xmm2/m32, imm8.
Perhaps this similarity with PEXTRD makes the internal implementation simpler without hurting the extract-to-memory use-case (for asm, not intrinsics), or maybe the SSE4.1 designers at Intel thought it was actually more useful this way than as a non-destructive FP-domain copy-and-shuffle (which x86 seriously lacks without AVX). There are FP-vector instructions that have an XMM source and a memory-or-xmm destination, like MOVSS xmm2/m32, xmm, so this kind of instruction would not be new. Fun fact: the opcodes for PEXTRD and EXTRACTPS differ only in the last bit.
In assembly, a scalar float is just the low element of an XMM register (or 4 bytes in memory). The upper elements of the XMM don't even have to be zeroed for instructions like ADDSS to work without raising any extra FP exceptions. In calling conventions that pass/return FP args in XMM registers (e.g. all the usual x86-64 ABIs), float foo(float a) must assume that the upper elements of XMM0 hold garbage on entry, but can leave garbage in the high elements of XMM0 on return. (More info).
As #doug points out, other shuffle instructions can be used to get a float element of a vector into the bottom of an xmm register. This was already a mostly-solved problem in SSE1/SSE2, and it seems EXTRACTPS and INSERTPS weren't trying to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in a register and they can't/don't optimize away zeroing the upper elements. (Which is most of the time for compilers other than clang). That linked question also further discusses the failure of intrinsics to expose the load or store versions of instructions like EXTRACTPS, INSERTPS, and PMOVZX that have a memory operand narrower than 128b (thus not requiring alignment even without AVX). It can be impossible to write safe code that compiles as efficiently as what you can do in asm.
Without AVX 3-operand SHUFPS, x86 doesn't provide a fully efficient and general-purpose way to copy-and-shuffle an FP vector the way integer PSHUFD can. SHUFPS is a different beast unless used in-place with src=dst. Preserving the original requires a MOVAPS, which costs a uop and latency on CPUs before IvyBridge, and always costs code-size. Using PSHUFD between FP instructions costs latency (bypass delays). (See this horizontal-sum answer for some tricks, like using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element into a separate register, but AFAIK it still has a dependency on the previous value of the destination even when all the original values are replaced. False dependencies like that are bad for out-of-order execution. xor-zeroing a register as a destination for INSERTPS would still be 2 uops, and have lower latency than MOVAPS+SHUFPS on SSE4.1 CPUs without mov-elimination for zero-latency MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you include low-power CPUs). The code-size is slightly worse, though.
Using _mm_extract_ps and then type-punning the result back to float (as suggested in the currently-accepted answer and its comments) is a bad idea. It's easy for your code to compile to something horrible (like EXTRACTPS to memory and then load back into an XMM register) on either gcc or icc. Clang seems to be immune to braindead behaviour and does its usual shuffle-compiling with its own choice of shuffle instructions (including appropriate use of EXTRACTPS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell, clang3.8.1, and icc17, on the Godbolt compiler explorer. I used C mode, not C++, but union-based type punning is allowed in GNU C++ as an extension to ISO C++. Pointer-casting for type-punning violates strict aliasing in C99 and C++, even with GNU extensions.
#include <immintrin.h>
// gcc:bad clang:good icc:good
void extr_unsafe_ptrcast(__m128 v, float *p) {
// violates strict aliasing
*(int*)p = _mm_extract_ps(v, 2);
}
gcc: # others extractps with a memory dest
extractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:bad
void extr_pun(__m128 v, float *p) {
// union type punning is safe in C99 (and GNU C and GNU C++)
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
*p = fp.f; // compiles to an extractps straight to memory
}
icc:
vextractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:horrible
void extr_gnu(__m128 v, float *p) {
// gcc uses extractps with a memory dest, icc does extr_store
*p = v[2];
}
gcc/clang:
extractps DWORD PTR [rdi], xmm0, 2
icc:
vmovups XMMWORD PTR [-24+rsp], xmm0
mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:poor
void extr_shuf(__m128 v, float *p) {
__m128 e2 = _mm_shuffle_ps(v,v, 2);
*p = _mm_cvtss_f32(e2); // gcc uses extractps
}
icc: (others: extractps right to memory)
vshufps xmm1, xmm0, xmm0, 2
vmovss DWORD PTR [rdi], xmm1
When you want the final result in an xmm register, it's up to the compiler to optimize away your extractps and do something completely different. Gcc and clang both succeed, but ICC doesn't.
// gcc:good clang:good icc:bad
float ret_pun(__m128 v) {
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
return fp.f;
}
gcc:
unpckhps xmm0, xmm0
clang:
shufpd xmm0, xmm0, 1
icc17:
vextractps DWORD PTR [-8+rsp], xmm0, 2
vmovss xmm0, DWORD PTR [-8+rsp]
Note that icc did poorly for extr_pun, too, so it doesn't like union-based type-punning for this.
The clear winner here is doing the shuffle "manually" with _mm_shuffle_ps(v,v, 2), and using _mm_cvtss_f32. We got optimal code from every compiler for both register and memory destinations, except for ICC which failed to use EXTRACTPS for the memory-dest case. With AVX, SHUFPS + separate store is still only 2 uops on Intel CPUs, just larger code size and needs a tmp register. Without AVX, though, it would cost a MOVAPS to not destroy the original vector :/
According to Agner Fog's instruction tables, all Intel CPUs except Nehalem implement the register-destination versions of both PEXTRD and EXTRACTPS with multiple uops: Usually just a shuffle uop + a MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS is 1 uop for port 5, with 1+2 cycle latency (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop but not PEXTRD (which is 2 uops, and runs in 2+1 cycle latency). Nehalem MOVD is 1 uop (and runs on any ALU port), with 1+1 cycle latency. (The +1 is for the bypass delay between vec-int and general-purpose integer regs, I think).
Nehalem cares a lot of about vector FP vs. integer domains; SnB-family CPUs have smaller (sometimes zero) bypass delay latencies between domains.
The memory-dest versions of PEXTRD and EXTRACTPS are both 2 uops on Nehalem.
On Broadwell and later, memory-destination EXTRACTPS and PEXTRD are 2 uops, but on Sandybridge through Haswell, memory-destination EXTRACTPS is 3 uops. Memory-destination PEXTRD is 2 uops on everything except Sandybridge, where it's 3. This seems odd, and Agner Fog's tables do sometimes have errors, but it's possible. Micro-fusion doesn't work with some instructions on some microarchitectures.
If either instruction had turned out to be extremely useful for anything important (e.g. inside inner loops), CPU designers would build execution units that could do the whole thing as one uop (or maybe 2 for the memory-dest). But that potentially requires more bits in the internal uop format (which Sandybridge simplified).
Fun fact: _mm_extract_epi32(vec, 0) compiles (on most compilers) to movd eax, xmm0 which is shorter and faster than pextrd eax, xmm0, 0.
Interestingly, they perform differently on Nehalem (which cares a lot of about vector FP vs. integer domains, and came out soon after SSE4.1 was introduced in Penryn (45nm Core2)). EXTRACTPS with a register destination is 1 uop, with 1+2 cycle latency (the +2 from a bypass delay between FP and integer domain). PEXTRD is 2 uops, and runs in 2+1 cycle latency.
From the MSDN docs, I believe you can cast the result to a float.
Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]).
Update: I strongly recommend the answers from #doug65536 and #PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers.
Try _mm_storeu_ps, or any of the variations of SSE store operations.

Instruction/intrinsic for taking higher half of uint64_t in C++?

Imagine following code:
Try it online!
uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);
y receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?
At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi and shr rax, 32 for this, so either CLang doesn't do such optimization, or there exists no such special instruction.
Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg.
If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)
You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)
You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32 to already generate the result in two separate 32-bit registers, instead of imul r64, r64 + shr reg,32, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont or other tuning options, you can't make the compiler do it that way.
shr reg, 32 has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.
Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov in addition to shr if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.
And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465
x86 with BMI2 has rorx rax, rdi, 32 to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar() where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)
There's no intrinsic for rorx; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell miss this optimization.) https://godbolt.org/z/ozjhcc8Te
If a compiler was doing this in a loop, it could have 32 in a register for shrx reg,reg,reg as a copy-and-shift. Or more silly, it could use pext with 0xffffffffULL << 32 as the mask. But that's strictly worse that shrx because of the higher latency.
AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr (bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4 (Excavator) uses bextr rax, rdi, 0x2020, while clang misses that optimization. gcc -march=znver1 uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.
Standard BMI1 bextr needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov+shr is faster on Intel CPUs.

How to get efficient asm for zeroing a tiny struct with MSVC++ for x86-32?

My project is compiled for 32-bit in both Windows and Linux. I have an 8-byte struct that's used just about everywhere:
struct Value {
unsigned char type;
union { // 4 bytes
unsigned long ref;
float num;
}
};
In a lot of places I need to zero out the struct, which is done like so:
#define NULL_VALUE_LITERAL {0, {0L}};
static const Value NULL_VALUE = NULL_VALUE_LITERAL;
// example of clearing a value
var = NULL_VALUE;
This however does not compile to the most efficient code in Visual Studio 2013, even with all optimizations on. What I see in the assembly is that the memory location for NULL_VALUE is being read, then written to the var. This results in two reads from memory and two writes to memory. This clearing however happens a lot, even in routines that are time-sensitive, and I'm looking to optimize.
If I set the value to NULL_VALUE_LITERAL, it's worse. The literal data, which again is all zeroes, is copied into temporary a stack value and THEN copied to the variable--even if the variable is also on the stack. So that's absurd.
There's also a common situation like this:
*pd->v1 = NULL_VALUE;
It has similar assembly code to the var=NULL_VALUE above, but it's something I can't optimize with inline assembly should I choose to go that route.
From my research the very, very fastest way to clear the memory would be something like this:
xor eax, eax
mov byte ptr [var], al
mov dword ptr [var+4], eax
Or better still, since the struct alignment means there's just junk for 3 bytes after the data type:
xor eax, eax
mov dword ptr [var], eax
mov dword ptr [var+4], eax
Can you think of any way I can get code similar to that, optimized to avoid the memory reads that are totally unnecessary?
I tried some other methods, which end up creating what I feel is overly bloated code writing a 32-bit 0 literal to the two addresses, but IIRC writing a literal to memory still isn't as fast as writing a register to memory. I'm looking to eke out any extra performance I can get.
Ideally I would also like the result to be highly readable. Your help is appreciated.
I'd recommend uint32_t or unsigned int for the union with float. long on Linux x86-64 is a 64-bit type, which is probably not what you want.
I can reproduce the missed-optimization with MSVC CL19 -Ox on the Godbolt compiler explorer for x86-32 and x86-64. Workarounds that work with CL19:
make type an unsigned int instead of char, so there's no padding in the struct, then assign from a literal {0, {0L}} instead of a static const Value object. (Then you get two mov-immediate stores: mov DWORD PTR [eax], 0 / mov DWORD PTR [eax+4], 0).
gcc also has struct-zeroing missed-optimizations with padding in structs, but not as bad as MSVC (Bug 82142). It just defeats merging into wider stores; it doesn't get gcc to create an object on the stack and copy from that.
std::memset: probably the best option, MSVC compiles it to a single 64-bit store using SSE2. xorps xmm0, xmm0 / movq QWORD PTR [mem], xmm0. (gcc -m32 -O3 compiles this memset to two mov-immediate stores.)
void arg_memset(Value *vp) {
memset(vp, 0, sizeof(gvar));
}
;; x86 (32-bit) MSVC -Ox
mov eax, DWORD PTR _vp$[esp-4]
xorps xmm0, xmm0
movq QWORD PTR [eax], xmm0
ret 0
This is what I'd choose for modern CPUs (Intel and AMD). The penalty for crossing a cache-line is low enough that it's worth saving an instruction if it doesn't happen all the time. xor-zeroing is extremely cheap (especially on Intel SnB-family).
IIRC writing a literal to memory still isn't as fast as writing a register to memory
In asm, constants embedded in the instruction are called immediate data. mov-immediate to memory is mostly fine on x86, but it's a bit bloated for code-size.
(x86-64 only): A store with a RIP-relative addressing mode and an immediate can't micro-fuse on Intel CPUs, so it's 2 fused-domain uops. (See Agner Fog's microarch pdf, and other links in the x86 tag wiki.) This means it's worth it (for front-end bandwidth) to zero a register if you're doing more than one store to a RIP-relative address. Other addressing modes do fuse, though, so it's just a code-size issue.
Related: Micro fusion and addressing modes (indexed addressing modes un-laminate on Sandybridge/Ivybridge, but Haswell and later can keep indexed stores micro-fused.) This isn't dependent on immediate vs. register source.
I think memset would be a very poor fit since this is just an 8-byte struct.
Modern compilers know what some heavily-used / important standard library functions do (memset, memcpy, etc.), and treat them like intrinsics. There's very little difference as far as optimization is concerned between a = b and memcpy(&a, &b, sizeof(a)) if they have the same type.
You might get a function call to the actual library implementation in debug mode, but debug mode is very slow anyway. If you have debug-mode perf requirements, that's unusual. (But does happen for code that needs to keep up with something else...)

How to set MMX registers in a Windows exception handler to emulate unsupported 3DNow! instructions

I'm trying to revive an old Win32 game that uses 3DNow! instruction set to make 3D rendering.
On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed and the program throws an exception.
Since the number of 3DNow! instuctions used by the game is very limited, in my VS2008 MFC program I tried to use vectored exception handling to get the value of MMX registers, emulate the 3DNow! instructions by C code and push the values back to the processor 3DNow! registers.
So far I succeeded in first two steps (I get mmx register values from ExceptionInfo->ExtendedRegisters byte array at offset 32 and use float type C instructions to make calculations), but my problem is that, no matter how I try to update the MMX register values the register values seem to stay unchanged.
Assuming that my _asm statements might be wrong, I did also some minimal test using simple statements like this:
_asm movq mm0 mm7
This statement is executed without further exceptions, but when retrieving the MMX register values I still find that the original values were unchanged.
How can I make the assignment effective?
On modern OSs like Win7 - Win10 instructions like FPADD or FPMUL are not allowed
More likely your CPU doesn't support 3DNow! AMD dropped it for Bulldozer-family, and Intel never supported it. So unless you're running modern Windows on an Athlon64 / Phenom (or a Via C3), your CPU doesn't support it.
(Fun fact: PREFETCHW was originally a 3DNow! instruction, and is still supported (with its own CPUID feature bit). For a long time Intel CPUs ran it as a NOP, but Broadwell and later (IIRC) do actually prefetch a cache line into Exclusive state with a Read-For-Ownership.)
Unless this game only ever ran on AMD hardware, it must have a code path that avoids 3DNow. Fix its CPU detection to stop detecting your CPU as having 3DNow. (Maybe you have a recent AMD, and it assumes any AMD has 3DNow?)
(update on that: OP's comments say that the other code paths don't work for some reason. That's a problem.)
Returning from an exception handler probably restores registers from saved state, so it's not surprising that changing register values in the exception handler has no effect on the main program.
Apparently updating ExtendedRegisters in memory doesn't do the trick, though, so that's only a copy of the saved state.
The answer to modifying MMX registers from an exception handler is probably the same as for integer or XMM registers, so look up MS's documentation for that.
Alternative suggestion:
Rewrite the 3DNow code to use SSE2. (You said there's only a tiny amount of it?). SSE2 is baseline for x86-64, and generally safe to assume for 32-bit x86.
Without source, you could still modify the asm for the few functions that use 3DNow. You can literally just change the instructions to use 64-bit loads/stores into XMM registers instead of 3DNow! 64-bit loads/stores, and replace PFMUL with mulps, etc. (This could get slightly hairy if you run out of registers and the 3DNow code used a memory source operand. addps xmm0, [mem] requires 16B-aligned memory, and does a 16 byte load. So you may have to add a spill/reload to borrow another register as a temporary).
If you don't have room to rewrite the functions in-place, put in a jmp to somewhere you do have room to add new code.
Most of the 3DNow instructions have equivalents in SSE, but you may need some extra movaps instructions to copy registers around to implement PFCMPGE. If you can ignore the possibility of NaN, you can use cmpps with a not-less-than predicate. (Without AVX, SSE only has compare predicates based on less-than or not-less-than).
PFSUBR is easy to emulate with a spare register, just copy and subps to reverse. (Or SUBPS and invert the sign with XORPS). PFRCPIT1 (reciprocal-sqrt first iteration of refinement) and so on don't have a single-instruction implementation, but you can probably just use sqrtps and divps if you don't want to implement Newton-Raphson iterations with mulps and addps (or with AVX vfmadd). Modern CPUs are much faster than what this game was designed for.
You can load / store a pair of single-precision floats from/to memory into the bottom 64 bits of an XMM register using movsd (the SSE2 double-precision load/store instruction). You can also store a pair with movlps, but still use movsd for loading because it zeros the upper half instead of merging, so it doesn't have a dependency on the old value of the register.
Use movdq2q mm0, xmm0 and movq2dq xmm0, mm0 to move data between XMM and MMX.
Use movaps xmm1, xmm0 to copy registers, even if your data is only in the low half. (movsd xmm1, xmm0 merges the low half into the original high half. movq xmm1, xmm0 zeros the high half.)
addps and mulps work fine with zeros in the upper half. (They can slow down if any garbage (in the upper half) produces a denormal result, so prefer keeping the upper half zeroed). See http://felixcloutier.com/x86/ for an instruction-set reference (and other links in the x86 tag wiki.
Any shuffling of FP data can be done in XMM registers with shufps or pshufd instead of copying back to MMX registers to use whatever MMX shuffles.

When does data move around between SSE registers and the stack?

I'm not exactly sure what happens when I call _mm_load_ps? I mean I know I load an array of 4 floats into a __m128, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128 data type still on the stack? I mean obviously there aren't enough registers for arbitrary amounts of vectors to be loaded in. So these 128 bits of data are moved back and forth each time you use some SIMD instruction to make computations? If so, than what is the point of _mm_load_ps?
Maybe I have it all wrong?
In just the same way that an int variable may reside in a register or in memory (or even both, at different times), the same is true of an SSE variable such as __m128. If there are sufficient free XMM registers then the compiler will normally try to keep the variable in a register (unless you do something unhelpful, like take the address of the variable), but if there is too much register pressure then some variables may spill to memory.
An Intel processor with SSE, AVX, or AVX-512 can have from 8 to 32 SIMD registers (see below). The number of registers also depends on if it's 32-bit code or 64-bit code as well. So when you call _mm_load_ps the values are loaded into SIMD register. If all the registers are used then some will have to be spilled onto the stack.
Exactly like if you have a lot of int or scalar float variables and the compiler can't keep them all the currently "live" ones in registers - load/store intrinsics mostly just exist to tell the compiler about alignment, and as an alternative to pointer-casting onto other C data types. Not because they have to compile to actual loads or stores, or that those are the only ways for compilers to emit vector load or store instructions.
Processor with SSE
8 128-bit registers labeled XMM0 - XMM7 //32-bit operating mode
16 128-bit registers labeled XMM0 - XMM15 //64-bit operating mode
Processor with AVX/AVX2
8 256-bit registers labeled YMM0 - YMM7 //32-bit operating mode
16 256 bit registers labeled YMM0 - YMM15 //64-bt operating mode
Processor with AVX-512 (2015/2016 servers, Ice Lake laptop, ?? desktop)
8 512-bit registers labeled ZMM0 - ZMM31 //32-bit operating mode
32 512-bit registers labeled ZMM0 - ZMM31 //64-bit operating mode
Wikipedia has a good summary on this AVX-512.
(Of course, the compiler can only use x/y/zmm16..31 if you tell it it's allowed to use AVX-512 instructions. Having an AVX-512-capable CPU does you no good when running machine code compiled to work on CPUs with only AVX2.)