Use intrinsics to load bytes as 32-bit SIMD elements [duplicate] - c++

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];

If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256 vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)
We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)
There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.

Related

Reinterpret casting from __m256i to __m256 [duplicate]

Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
None of the answers appear to actually answer the question, why does it return int.
The reason is, the extractps instruction actually copies a component of the vector to a general register. It does seem pretty silly for it to return an int but that's what's actually happening - the raw floating point value ends up in a general register (which hold integers).
If your compiler is configured to generate SSE for all floating point operations, then the closest thing to "extracting" a value to a register would be to shuffle the value into the low component of the vector, then cast it to a scalar float. This should cause that component of the vector to remain in an SSE register:
/* returns the second component of the vector */
float foo(__m128 b)
{
return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));
}
The _mm_cvtss_f32 intrinsic is free, it does not generate instructions, it only makes the compiler reinterpret the xmm register as a float so it can be returned as such.
The _mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro generates an immediate operand for the resulting shufps instruction.
The 2 in the example gets the float from bit 95:64 of the 127:0 register (the 3rd 32 bit component from the beginning, in memory order) and places it in the 31:0 component of the register (the beginning, in memory order).
The resulting generated code will most likely return the value naturally in a register, like any other floating point value return, with no inefficient writing out to memory and reading it back.
If you're generating code that uses the x87 FPU for floating point (for normal C code that isn't SSE optimized), this would probably result in inefficient code being generated - the compiler would probably store out the component of the SSE vector then use fld to read it back into the x87 register stack. In general 64-bit platforms don't use x87 (they use SSE for all floating point, mostly scalar instructions unless the compiler is vectorizing).
I should add that I always use C++, so I'm not sure whether it is more efficient to pass __m128 by value or by pointer in C. In C++ I would use a const __m128 & and this kind of code would be in a header, so the compiler can inline.
Confusingly, int _mm_extract_ps() is not for getting a scalar float element from a vector. The intrinsic doesn't expose the memory-destination form of the instruction (which can be useful for that purpose). This is not the only case where the intrinsics can't directly express everything an instruction is useful for. :(
gcc and clang know how the asm instruction works and will use it that way for you when compiling other shuffles; type-punning the _mm_extract_ps result to float usually results in horrible asm from gcc (extractps eax, xmm0, 2 / mov [mem], eax).
The name makes sense if you think of _mm_extract_ps as extracting an IEEE 754 binary32 float bit pattern out of the FP domain of the CPU into the integer domain (as a C scalar int), instead of manipulating FP bit patterns with integer vector ops. According to my testing with gcc, clang, and icc (see below), this is the only "portable" use-case where _mm_extract_ps compiles into good asm across all compilers. Anything else is just a compiler-specific hack to get the asm you want.
The corresponding asm instruction is EXTRACTPS r/m32, xmm, imm8. Notice that the destination can be memory or an integer register, but not another XMM register. It's the FP equivalent of PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer-register-destination form is more obviously useful. EXTRACTPS is not the reverse of INSERTPS xmm1, xmm2/m32, imm8.
Perhaps this similarity with PEXTRD makes the internal implementation simpler without hurting the extract-to-memory use-case (for asm, not intrinsics), or maybe the SSE4.1 designers at Intel thought it was actually more useful this way than as a non-destructive FP-domain copy-and-shuffle (which x86 seriously lacks without AVX). There are FP-vector instructions that have an XMM source and a memory-or-xmm destination, like MOVSS xmm2/m32, xmm, so this kind of instruction would not be new. Fun fact: the opcodes for PEXTRD and EXTRACTPS differ only in the last bit.
In assembly, a scalar float is just the low element of an XMM register (or 4 bytes in memory). The upper elements of the XMM don't even have to be zeroed for instructions like ADDSS to work without raising any extra FP exceptions. In calling conventions that pass/return FP args in XMM registers (e.g. all the usual x86-64 ABIs), float foo(float a) must assume that the upper elements of XMM0 hold garbage on entry, but can leave garbage in the high elements of XMM0 on return. (More info).
As #doug points out, other shuffle instructions can be used to get a float element of a vector into the bottom of an xmm register. This was already a mostly-solved problem in SSE1/SSE2, and it seems EXTRACTPS and INSERTPS weren't trying to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in a register and they can't/don't optimize away zeroing the upper elements. (Which is most of the time for compilers other than clang). That linked question also further discusses the failure of intrinsics to expose the load or store versions of instructions like EXTRACTPS, INSERTPS, and PMOVZX that have a memory operand narrower than 128b (thus not requiring alignment even without AVX). It can be impossible to write safe code that compiles as efficiently as what you can do in asm.
Without AVX 3-operand SHUFPS, x86 doesn't provide a fully efficient and general-purpose way to copy-and-shuffle an FP vector the way integer PSHUFD can. SHUFPS is a different beast unless used in-place with src=dst. Preserving the original requires a MOVAPS, which costs a uop and latency on CPUs before IvyBridge, and always costs code-size. Using PSHUFD between FP instructions costs latency (bypass delays). (See this horizontal-sum answer for some tricks, like using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element into a separate register, but AFAIK it still has a dependency on the previous value of the destination even when all the original values are replaced. False dependencies like that are bad for out-of-order execution. xor-zeroing a register as a destination for INSERTPS would still be 2 uops, and have lower latency than MOVAPS+SHUFPS on SSE4.1 CPUs without mov-elimination for zero-latency MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you include low-power CPUs). The code-size is slightly worse, though.
Using _mm_extract_ps and then type-punning the result back to float (as suggested in the currently-accepted answer and its comments) is a bad idea. It's easy for your code to compile to something horrible (like EXTRACTPS to memory and then load back into an XMM register) on either gcc or icc. Clang seems to be immune to braindead behaviour and does its usual shuffle-compiling with its own choice of shuffle instructions (including appropriate use of EXTRACTPS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell, clang3.8.1, and icc17, on the Godbolt compiler explorer. I used C mode, not C++, but union-based type punning is allowed in GNU C++ as an extension to ISO C++. Pointer-casting for type-punning violates strict aliasing in C99 and C++, even with GNU extensions.
#include <immintrin.h>
// gcc:bad clang:good icc:good
void extr_unsafe_ptrcast(__m128 v, float *p) {
// violates strict aliasing
*(int*)p = _mm_extract_ps(v, 2);
}
gcc: # others extractps with a memory dest
extractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:bad
void extr_pun(__m128 v, float *p) {
// union type punning is safe in C99 (and GNU C and GNU C++)
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
*p = fp.f; // compiles to an extractps straight to memory
}
icc:
vextractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:horrible
void extr_gnu(__m128 v, float *p) {
// gcc uses extractps with a memory dest, icc does extr_store
*p = v[2];
}
gcc/clang:
extractps DWORD PTR [rdi], xmm0, 2
icc:
vmovups XMMWORD PTR [-24+rsp], xmm0
mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:poor
void extr_shuf(__m128 v, float *p) {
__m128 e2 = _mm_shuffle_ps(v,v, 2);
*p = _mm_cvtss_f32(e2); // gcc uses extractps
}
icc: (others: extractps right to memory)
vshufps xmm1, xmm0, xmm0, 2
vmovss DWORD PTR [rdi], xmm1
When you want the final result in an xmm register, it's up to the compiler to optimize away your extractps and do something completely different. Gcc and clang both succeed, but ICC doesn't.
// gcc:good clang:good icc:bad
float ret_pun(__m128 v) {
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
return fp.f;
}
gcc:
unpckhps xmm0, xmm0
clang:
shufpd xmm0, xmm0, 1
icc17:
vextractps DWORD PTR [-8+rsp], xmm0, 2
vmovss xmm0, DWORD PTR [-8+rsp]
Note that icc did poorly for extr_pun, too, so it doesn't like union-based type-punning for this.
The clear winner here is doing the shuffle "manually" with _mm_shuffle_ps(v,v, 2), and using _mm_cvtss_f32. We got optimal code from every compiler for both register and memory destinations, except for ICC which failed to use EXTRACTPS for the memory-dest case. With AVX, SHUFPS + separate store is still only 2 uops on Intel CPUs, just larger code size and needs a tmp register. Without AVX, though, it would cost a MOVAPS to not destroy the original vector :/
According to Agner Fog's instruction tables, all Intel CPUs except Nehalem implement the register-destination versions of both PEXTRD and EXTRACTPS with multiple uops: Usually just a shuffle uop + a MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS is 1 uop for port 5, with 1+2 cycle latency (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop but not PEXTRD (which is 2 uops, and runs in 2+1 cycle latency). Nehalem MOVD is 1 uop (and runs on any ALU port), with 1+1 cycle latency. (The +1 is for the bypass delay between vec-int and general-purpose integer regs, I think).
Nehalem cares a lot of about vector FP vs. integer domains; SnB-family CPUs have smaller (sometimes zero) bypass delay latencies between domains.
The memory-dest versions of PEXTRD and EXTRACTPS are both 2 uops on Nehalem.
On Broadwell and later, memory-destination EXTRACTPS and PEXTRD are 2 uops, but on Sandybridge through Haswell, memory-destination EXTRACTPS is 3 uops. Memory-destination PEXTRD is 2 uops on everything except Sandybridge, where it's 3. This seems odd, and Agner Fog's tables do sometimes have errors, but it's possible. Micro-fusion doesn't work with some instructions on some microarchitectures.
If either instruction had turned out to be extremely useful for anything important (e.g. inside inner loops), CPU designers would build execution units that could do the whole thing as one uop (or maybe 2 for the memory-dest). But that potentially requires more bits in the internal uop format (which Sandybridge simplified).
Fun fact: _mm_extract_epi32(vec, 0) compiles (on most compilers) to movd eax, xmm0 which is shorter and faster than pextrd eax, xmm0, 0.
Interestingly, they perform differently on Nehalem (which cares a lot of about vector FP vs. integer domains, and came out soon after SSE4.1 was introduced in Penryn (45nm Core2)). EXTRACTPS with a register destination is 1 uop, with 1+2 cycle latency (the +2 from a bypass delay between FP and integer domain). PEXTRD is 2 uops, and runs in 2+1 cycle latency.
From the MSDN docs, I believe you can cast the result to a float.
Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]).
Update: I strongly recommend the answers from #doug65536 and #PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers.
Try _mm_storeu_ps, or any of the variations of SSE store operations.

Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instructions

I have a float array containing A,B,C,D 4 float numbers and I wish to load them into a __m256 variable like AABBCCDD. What's the best way to do this?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions. Thanks.
If your data was the result of another vector calculation (and in a __m128), you'd want AVX2 vpermps (_mm256_permutexvar_ps) with a control vector of _mm256_set_epi32(3,3, 2,2, 1,1, 0,0).
vpermps ymm is 1 uop on Intel, but 2 uops on Zen2 (with 2 cycle throughput). And 3 uops on Zen1 with one per 4 clock throughput. (https://uops.info/)
If it was the result of separate scalar calculations, you might want to shuffle them together with _mm_set_ps(d,d, c,c) (1x vshufps) to set up for a vinsertf128.
But with data in memory, I think your best bet is a 128-bit broadcast-load, then an in-lane shuffle. It only requires AVX1, and on modern CPUs it's 1 load + 1 shuffle uop on Zen2 and Haswell and later. It's also efficient on Zen1: the only lane-crossing shuffle being the 128-bit broadcast-load.
Using an in-lane shuffle is lower-latency than lane-crossing on both Intel and Zen2 (256-bit shuffle execution units). This still requires a 32-byte shuffle control vector constant, but if you need to do this frequently it will typically / hopefully stay hot in cache.
__m256 duplicate4floats(void *p) {
__m256 v = _mm256_broadcast_ps((const __m128 *) p); // vbroadcastf128
v = _mm256_permutevar_ps(v, _mm256_set_epi32(3,3, 2,2, 1,1, 0,0)); // vpermilps
return v;
}
Modern CPUs handle broadcast-loads right in the load port, no shuffle uop needed. (Sandybridge does need a port 5 shuffle uop for vbroadcastf128, unlike narrower broadcasts, but Haswell and later are purely port 2/3. But SnB doesn't support AVX2 so a lane-crossing shuffle with granularity less than 128-bit wasn't an option.)
So even if AVX2 is available, I think AVX1 instructions are more efficient here. On Zen1, vbroadcastf128 is 2 uops, vs. 1 for a 128-bit vmovups, but vpermps (lane-crossing) is 3 uops vs. 2 for vpermilps.
Unfortunately, clang pessimizes this into a vmovups load and a vpermps ymm, but GCC compiles it as written. (Godbolt)
If you wanted to avoid using a shuffle-control vector constant, vpmovzxdq ymm, [mem] (2 uops on Intel) could get the elements set up for vmovsldup (1 uops in-lane shuffle). Or broadcast-load and vunpckl/hps then blend?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions.
Get a better compiler, then! (Or remember to enable optimization.)
__m256 duplicate4floats_naive(const float *p) {
return _mm256_set_ps(p[3],p[3], p[2], p[2], p[1],p[1], p[0],p[0]);
}
compiles with gcc (https://godbolt.org/z/dMzh3fezE) into
duplicate4floats_naive(float const*):
vmovups xmm1, XMMWORD PTR [rdi]
vpermilps xmm0, xmm1, 80
vpermilps xmm1, xmm1, 250
vinsertf128 ymm0, ymm0, xmm1, 0x1
ret
So 3 shuffle uops, not great. And it could have used vshufps instead of vpermilps to save code-size and let it run on more ports on Ice Lake. But still vastly better than 8 instructions.
clang's shuffle optimizer makes the same asm as with my optimized intrinsics, because that's how clang is. It's pretty decent optimization, just not quite optimal.
duplicate4floats_naive(float const*):
vmovups xmm0, xmmword ptr [rdi]
vmovaps ymm1, ymmword ptr [rip + .LCPI1_0] # ymm1 = [0,0,1,1,2,2,3,3]
vpermps ymm0, ymm1, ymm0
ret
_mm_load_ps -> _mm256_castps128_ps256 -> _mm256_permute_ps

Optimizing XOR of `char` with the highest byte of `int`

Let we have int i and char c.
When use i ^= c the compiler will XOR c with the lowest byte of i, and will translate the code to the single processor instruction.
When we need to XOR c with highest byte of i we can do something like this:
i ^= c << ((sizeof(i) - sizeof(c)) * 8)
but the compiler will generate two instructions: XOR and BIT-SHIFT.
Is there a way to XOR a char with the highest byte of an int that will be translated to single processor instruction in C++?
If you are confident about the byte-order of the system, for example by checking __BYTE_ORDER__ or equivalent macro on your system, you can do something like this:
#if // Somehow determing if little endian, so biggest byte at the end
*(&reinterpret_cast<char&>(i) + sizeof i - 1) ^= c
#else
// Is big endian, biggest byte at the beginning
reinterpret_cast<char&>(i) ^= c
#endif
Compilers are really smart about such simple arithmetic and bitwise operations. They don't do that simply because they can't, as there are no such instructions on those architectures. It doesn't worth wasting valuable opcode space for rarely used operations like that. Most operations are done in the whole register anyway, and working on just a part of a register is very inefficient for the CPU, because the out-of-order execution or register renaming units will need to work a lot harder. That's the reason why x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register, or why modifying the low part of the register in x86 (like AL or AX) can be slower than modifying the whole RAX. INC can also be slower than ADD 1 because of the partial flag update
That said, there are architectures that can do combined SHIFT and XOR in a single instruction like ARM (eor rX, rX, rY, lsl #24), because ARM designers spent a big part of the instruction encoding for the predication and shifting part, trading off for a smaller number of registers. But again your premise is wrong, because the fact that something can be executed in a single instruction doesn't mean that it'll be faster. Modern CPUs are very complex because every instruction has different latency, throughput and the number of execution ports. For example if a CPU can execute 4 pairs of SHIFT-then-XOR in parallel then obviously it'll be faster than another CPU that can run 4 single SHIFT-XOR instructions sequentially, provided that clock cycle is the same
This is a very typical XY problem, because what you thought is simply the wrong way to do. For operations that need to be done thousands, millions of times or more then it's the job of the GPU or the SIMD unit
For example this is what the Clang compiler emits for a loop XORing the top byte of i with c on an x86 CPU with AVX-512
vpslld zmm0, zmm0, 24 # shift
vpslld zmm1, zmm1, 24
vpslld zmm2, zmm2, 24
vpslld zmm3, zmm3, 24
vpxord zmm0, zmm0, zmmword ptr [rdi + 4*rdx] # xor
vpxord zmm1, zmm1, zmmword ptr [rdi + 4*rdx + 64]
vpxord zmm2, zmm2, zmmword ptr [rdi + 4*rdx + 128]
vpxord zmm3, zmm3, zmmword ptr [rdi + 4*rdx + 192]
By doing that it achieves 16 SHIFT-and-XOR operations with just 2 instruction. Imagine how fast that is. You can unroll more to 32 zmm registers to achieve even higher performance until you saturate the RAM bandwidth. That's why all high-performance architectures have some kind of SIMD which is easier to do fast parallel operations, rather than a useless SHIFT-XOR instruction. Even on ARM with a single-instruction SHIFT-XOR then the compiler will be smart enough to know that SIMD is faster than a series of eor rX, rX, rY, lsl #24. The output is like this
shl v3.4s, v3.4s, 24 # shift
shl v2.4s, v2.4s, 24
shl v1.4s, v1.4s, 24
shl v0.4s, v0.4s, 24
eor v3.16b, v3.16b, v7.16b # xor
eor v2.16b, v2.16b, v6.16b
eor v1.16b, v1.16b, v4.16b
eor v0.16b, v0.16b, v5.16b
Here's a demo for the above snippets
That'll be even faster when running in parallel in multiple cores. The GPU is also able to do very high level or parallelism, hence modern cryptography and intense mathematical problems are often done on the GPU. It can break a password or encrypt a file faster than a general purpose CPU with SIMD
Don't assume that the compiler will generate a shift with the above code. Most modern compilers are smarter than that:
int i = 0; // global in memory
char c = 0;
void foo ()
{
c ^= (i >> 24);
}
Compiles (https://godbolt.org/z/b6l8qk) with GCC (trunk) -O3 to this asm for x86-64:
foo():
movsx eax, BYTE PTR i[rip+3] # sign-extending byte load
xor BYTE PTR c[rip], al # memory-destination byte xor
ret
clang, ICC, and MSVC all emit equivalent code (some using a movzx load which is more efficient on some CPUs than movsx).
This only really helps if the int is in memory, not a register, which is hopefully not the case in a tight loop unless it's different integers (like in an array).

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm256_hadd_pd(acc, acc);
result[i] = ((double*)&acc)[0] + ((double*)&acc)[2];
This code works, but I want to replace it with SSE/AVX instruction.
It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.
For a dot-product of an array larger than one vector, sum vertically (into multiple accumulators), only hsumming once at the end.
For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction) - extract the high half and add to the low half. Repeat until you're down to 1 element.
If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.
For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:
#include <immintrin.h>
inline
double hsum_double_avx(__m256d v) {
__m128d vlow = _mm256_castpd256_pd128(v);
__m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
vlow = _mm_add_pd(vlow, vhigh); // reduce down to 128
__m128d high64 = _mm_unpackhi_pd(vlow, vlow);
return _mm_cvtsd_f64(_mm_add_sd(vlow, high64)); // reduce to scalar
}
If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.
Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer
vmovapd xmm1, xmm0 # silly compiler, vextract to xmm1 instead
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm1, xmm0
vunpckhpd xmm1, xmm0, xmm0 # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
vaddsd xmm0, xmm0, xmm1 # scalar means we never raise FP exceptions for results we don't use
vzeroupper
ret
After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.
On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.
#PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)
On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.
# Paul's version # Ryzen # Skylake
vhaddpd ymm0, ymm0, ymm0 # 8 uops # 3 uops
vperm2f128 ymm1, ymm0, ymm0, 49 # 8 uops # 1 uop
vaddpd ymm0, ymm0, ymm1 # 2 uops # 1 uop
# total uops: # 18 # 5
vs.
# my version with vmovapd optimized out: extract to a different reg
vextractf128 xmm1, ymm0, 0x1 # 1 uop # 1 uop
vaddpd xmm0, xmm1, xmm0 # 1 uop # 1 uop
vunpckhpd xmm1, xmm0, xmm0 # 1 uop # 1 uop
vaddsd xmm0, xmm0, xmm1 # 1 uop # 1 uop
# total uops: # 4 # 4
Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)
The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.
Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.
You can do it like this:
acc = _mm256_hadd_pd(acc, acc); // horizontal add top lane and bottom lane
acc = _mm256_add_pd(acc, _mm256_permute2f128_pd(acc, acc, 0x31)); // add lanes
result[i] = _mm256_cvtsd_f64(acc); // extract double
Note: if this is in a "hot" (i.e. performance-critical) part of your code (especially if running on an AMD CPU) then you might instead want to look at Peter Cordes's answer regarding more efficient implementations.
In gcc and clang SIMD types are built-in vector types. E.g.:
# avxintrin.h
typedef double __m256d __attribute__((__vector_size__(32), __aligned__(32)));
These built-in vectors support indexing, so you can write it conveniently and leave it up to the compiler to make good code:
double hsum_double_avx2(__m256d v) {
return v[0] + v[1] + v[2] + v[3];
}
clang-14 -O3 -march=znver3 -ffast-math generates the same assembly as it does for Peter Cordes's intrinsics:
# clang -O3 -ffast-math
hsum_double_avx2:
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
vzeroupper
ret
Unfortunately gcc does much worse, which generates sub-optimal instructions, not taking advantage of the freedom to re-associate the 3 + operations, and using vhaddpd xmm to do the v[0] + v[1] part, which costs 4 uops on on Zen 3. (Or 3 uops on Intel CPUs, 2 shuffles + an add.)
-ffast-math is of course necessary for the compiler to be able to do a good job, unless you write it as (v[0]+v[2]) + (v[1]+v[3]). With that, clang still makes the same asm with -O3 -march=icelake-server without -ffast-math.
Ideally, I want to write plain code as I did above and let the compiler use a CPU-specific cost model to emit optimal instructions in right order for that specific CPU.
One reason being that labour-intensive hand-coded optimal version for Haswell may well be suboptimal for Zen3. For this problem specifically, that's not really the case: starting by narrowing to 128-bit with vextractf128 + vaddpd is optimal everywhere. There are minor variations in shuffle throughput on different CPUs; for example Ice Lake and later Intel can run vshufps on port 1 or 5, but some shuffles like vpermilps/pd or vunpckhpd still only on port 5. Zen 3 (like Zen 2 and 4) has good throughput for either of those shuffles so clang's asm happens to be good there. But it's unfortunate that clang -march=icelake-server still uses vpermilpd
A frequent use-case nowadays is computing in the cloud with diverse CPU models and generations, compiling the code on that host with -march=native -mtune=native for best performance.
In theory, if compilers were smarter, this would optimize short sequences like this to ideal asm, as well as making generally good choices for heuristics like inlining and unrolling. It's usually the best choice for a binary that will run on only one machine, but as GCC demonstrates here, the results are often far from optimal. Fortunately modern AMD and Intel aren't too different most of the time, having different throughputs for some instructions but usually being single-uop for the same instructions.

Loading 8 chars from memory into an __m256 variable as packed single precision floats

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];
If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256 vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)
We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)
There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.