Optimizing XOR of `char` with the highest byte of `int`

Optimizing XOR of `char` with the highest byte of `int` - c++

Let we have int i and char c.
When use i ^= c the compiler will XOR c with the lowest byte of i, and will translate the code to the single processor instruction.
When we need to XOR c with highest byte of i we can do something like this:
i ^= c << ((sizeof(i) - sizeof(c)) * 8)
but the compiler will generate two instructions: XOR and BIT-SHIFT.
Is there a way to XOR a char with the highest byte of an int that will be translated to single processor instruction in C++?

If you are confident about the byte-order of the system, for example by checking __BYTE_ORDER__ or equivalent macro on your system, you can do something like this:
#if // Somehow determing if little endian, so biggest byte at the end
*(&reinterpret_cast<char&>(i) + sizeof i - 1) ^= c
#else
// Is big endian, biggest byte at the beginning
reinterpret_cast<char&>(i) ^= c
#endif

Compilers are really smart about such simple arithmetic and bitwise operations. They don't do that simply because they can't, as there are no such instructions on those architectures. It doesn't worth wasting valuable opcode space for rarely used operations like that. Most operations are done in the whole register anyway, and working on just a part of a register is very inefficient for the CPU, because the out-of-order execution or register renaming units will need to work a lot harder. That's the reason why x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register, or why modifying the low part of the register in x86 (like AL or AX) can be slower than modifying the whole RAX. INC can also be slower than ADD 1 because of the partial flag update
That said, there are architectures that can do combined SHIFT and XOR in a single instruction like ARM (eor rX, rX, rY, lsl #24), because ARM designers spent a big part of the instruction encoding for the predication and shifting part, trading off for a smaller number of registers. But again your premise is wrong, because the fact that something can be executed in a single instruction doesn't mean that it'll be faster. Modern CPUs are very complex because every instruction has different latency, throughput and the number of execution ports. For example if a CPU can execute 4 pairs of SHIFT-then-XOR in parallel then obviously it'll be faster than another CPU that can run 4 single SHIFT-XOR instructions sequentially, provided that clock cycle is the same
This is a very typical XY problem, because what you thought is simply the wrong way to do. For operations that need to be done thousands, millions of times or more then it's the job of the GPU or the SIMD unit
For example this is what the Clang compiler emits for a loop XORing the top byte of i with c on an x86 CPU with AVX-512
vpslld zmm0, zmm0, 24 # shift
vpslld zmm1, zmm1, 24
vpslld zmm2, zmm2, 24
vpslld zmm3, zmm3, 24
vpxord zmm0, zmm0, zmmword ptr [rdi + 4*rdx] # xor
vpxord zmm1, zmm1, zmmword ptr [rdi + 4*rdx + 64]
vpxord zmm2, zmm2, zmmword ptr [rdi + 4*rdx + 128]
vpxord zmm3, zmm3, zmmword ptr [rdi + 4*rdx + 192]
By doing that it achieves 16 SHIFT-and-XOR operations with just 2 instruction. Imagine how fast that is. You can unroll more to 32 zmm registers to achieve even higher performance until you saturate the RAM bandwidth. That's why all high-performance architectures have some kind of SIMD which is easier to do fast parallel operations, rather than a useless SHIFT-XOR instruction. Even on ARM with a single-instruction SHIFT-XOR then the compiler will be smart enough to know that SIMD is faster than a series of eor rX, rX, rY, lsl #24. The output is like this
shl v3.4s, v3.4s, 24 # shift
shl v2.4s, v2.4s, 24
shl v1.4s, v1.4s, 24
shl v0.4s, v0.4s, 24
eor v3.16b, v3.16b, v7.16b # xor
eor v2.16b, v2.16b, v6.16b
eor v1.16b, v1.16b, v4.16b
eor v0.16b, v0.16b, v5.16b
Here's a demo for the above snippets
That'll be even faster when running in parallel in multiple cores. The GPU is also able to do very high level or parallelism, hence modern cryptography and intense mathematical problems are often done on the GPU. It can break a password or encrypt a file faster than a general purpose CPU with SIMD

Don't assume that the compiler will generate a shift with the above code. Most modern compilers are smarter than that:
int i = 0; // global in memory
char c = 0;
void foo ()
{
c ^= (i >> 24);
}
Compiles (https://godbolt.org/z/b6l8qk) with GCC (trunk) -O3 to this asm for x86-64:
foo():
movsx eax, BYTE PTR i[rip+3] # sign-extending byte load
xor BYTE PTR c[rip], al # memory-destination byte xor
ret
clang, ICC, and MSVC all emit equivalent code (some using a movzx load which is more efficient on some CPUs than movsx).
This only really helps if the int is in memory, not a register, which is hopefully not the case in a tight loop unless it's different integers (like in an array).

Related

Use intrinsics to load bytes as 32-bit SIMD elements [duplicate]

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];

If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256 vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)
We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)
There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.

Why are clang and GCC not using xchg to implement std::swap?

I have the following code:
char swap(char reg, char* mem) {
std::swap(reg, *mem);
return reg;
}
I expected this to compile down to:
swap(char, char*):
xchg dil, byte ptr [rsi]
mov al, dil
ret
But what it actually compiles to is (at -O3 -march=haswell -std=c++20):
swap(char, char*):
mov al, byte ptr [rsi]
mov byte ptr [rsi], dil
ret
See here for a live demo.
From the documentation of xchg, the first form should be perfectly possible:
XCHG - Exchange Register/Memory with Register
Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general-purpose registers or a register and a memory location.
So is there any particular reason why it's not possible for the compiler to use xchg here? I have tried other examples too, such as swapping pointers, swapping three operands, swapping types other than char but I never get an xchg in the compile output. How come?

TL:DR: because compilers optimize for speed, not for names that sound similar. There are lots of other terrible ways they also could have implemented it, but chose not to.
xchg with mem has an implicit lock prefix (on 386 and later) so it's horribly slow. You always want to avoid it unless you need an atomic exchange, or are optimizing completely for code-size without caring at all for performance, in cases where you do want the result in the same register as the original value. Sometimes seen in naive (performance oblivious) or code-golfed hand-written Bubble Sort as part of swapping 2 memory locations.
Possibly clang -Oz could go that crazy, IDK, but hopefully wouldn't in this case because your xchg way is larger code size, needing a REX prefix on both instructions to access DIL, vs. the 2-mov way being a 2-byte and a 3-byte instruction. clang -Oz does do stuff like push 1 / pop rax instead of mov eax, 1 to save 2 bytes of code size.
GCC -Os won't use xchg for swaps that don't need to be atomic because -Os still cares some about speed.
Also, IDK why would you think xchg + dependent mov would be faster or a better choice than two independent mov instructions that can run in parallel. (The store buffer makes sure that the store is correctly ordered after the load, regardless of which uop finds its execution port free first).
See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
Seriously, I just don't see any plausible reason why you'd think a compiler might want to use xchg, especially given that the calling convention doesn't pass an arg in RAX so you still need 2 instructions. Even for registers, xchg reg,reg on Intel CPUs is 3 uops, and they're microcode uops that can't benefit from mov-elimination. (Some AMD CPUs have 2-uop xchg reg,reg. Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?)
I also guess you're looking at clang output; GCC will avoid partial register shenanigans (like false dependencies) by using a movzx eax, byte ptr [rsi] load even though the return value is only the low byte. Zero-extending loads are cheaper than merging into the old value of RAX. So that's another downside to xchg.

So is there any particular reason why it's not possible for the compiler to use xchg here?
Because mov is faster than xchg and compilers optimize for speed.
See:
Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?
Use xchg for -Os
Bug 47949 - Missed optimization for -Os using xchg instead of mov

Atomic double floating point or SSE/AVX vector load/store on x86_64

Here (and in a few SO questions) I see that C++ doesn't support something like lock-free std::atomic<double> and can't yet support something like atomic AVX/SSE vector because it's CPU-dependent (though nowadays of CPUs I know, ARM, AArch64 and x86_64 have vectors).
But is there assembly-level support for atomic operations on doubles or vectors in x86_64? If so, which operations are supported (like load, store, add, subtract, multiply maybe)? Which operations does MSVC++2017 implement lock-free in atomic<double>?

C++ doesn't support something like lock-free std::atomic<double>
Actually, C++11 std::atomic<double> is lock-free on typical C++ implementations, and does expose nearly everything you can do in asm for lock-free programming with float/double on x86 (e.g. load, store, and CAS are enough to implement anything: Why isn't atomic double fully implemented). Current compilers don't always compile atomic<double> efficiently, though.
C++11 std::atomic doesn't have an API for Intel's transactional-memory extensions (TSX) (for FP or integer). TSX could be a game-changer especially for FP / SIMD, since it would remove all overhead of bouncing data between xmm and integer registers. If the transaction doesn't abort, whatever you just did with double or vector loads/stores happens atomically.
Some non-x86 hardware supports atomic add for float/double, and C++ p0020 is a proposal to add fetch_add and operator+= / -= template specializations to C++'s std::atomic<float> / <double>.
Hardware with LL/SC atomics instead of x86-style memory-destination instruction, such as ARM and most other RISC CPUs, can do atomic RMW operations on double and float without a CAS, but you still have to get the data from FP to integer registers because LL/SC is usually only available for integer regs, like x86's cmpxchg. However, if the hardware arbitrates LL/SC pairs to avoid/reduce livelock, it would be significantly more efficient than with a CAS loop in very-high-contention situations. If you've designed your algorithms so contention is rare, there's maybe only a small code-size difference between an LL/add/SC retry-loop for fetch_add vs. a load + add + LL/SC CAS retry loop.
x86 natually-aligned loads and stores are atomic up to 8 bytes, even x87 or SSE. (For example movsd xmm0, [some_variable] is atomic, even in 32-bit mode). In fact, gcc uses x87 fild/fistp or SSE 8B loads/stores to implement std::atomic<int64_t> load and store in 32-bit code.
Ironically, compilers (gcc7.1, clang4.0, ICC17, MSVC CL19) do a bad job in 64-bit code (or 32-bit with SSE2 available), and bounce data through integer registers instead of just doing movsd loads/stores directly to/from xmm regs (see it on Godbolt):
#include <atomic>
std::atomic<double> ad;
void store(double x){
ad.store(x, std::memory_order_release);
}
// gcc7.1 -O3 -mtune=intel:
// movq rax, xmm0 # ALU xmm->integer
// mov QWORD PTR ad[rip], rax
// ret
double load(){
return ad.load(std::memory_order_acquire);
}
// mov rax, QWORD PTR ad[rip]
// movq xmm0, rax
// ret
Without -mtune=intel, gcc likes to store/reload for integer->xmm. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and related bugs I reported. This is a poor choice even for -mtune=generic. AMD has high latency for movq between integer and vector regs, but it also has high latency for a store/reload. With the default -mtune=generic, load() compiles to:
// mov rax, QWORD PTR ad[rip]
// mov QWORD PTR [rsp-8], rax # store/reload integer->xmm
// movsd xmm0, QWORD PTR [rsp-8]
// ret
Moving data between xmm and integer register brings us to the next topic:
Atomic read-modify-write (like fetch_add) is another story: there is direct support for integers with stuff like lock xadd [mem], eax (see Can num++ be atomic for 'int num'? for more details). For other things, like atomic<struct> or atomic<double>, the only option on x86 is a retry loop with cmpxchg (or TSX).
Atomic compare-and-swap (CAS) is usable as a lock-free building-block for any atomic RMW operation, up to the max hardware-supported CAS width. On x86-64, that's 16 bytes with cmpxchg16b (not available on some first-gen AMD K8, so for gcc you have to use -mcx16 or -march=whatever to enable it).
gcc makes the best asm possible for exchange():
double exchange(double x) {
return ad.exchange(x); // seq_cst
}
movq rax, xmm0
xchg rax, QWORD PTR ad[rip]
movq xmm0, rax
ret
// in 32-bit code, compiles to a cmpxchg8b retry loop
void atomic_add1() {
// ad += 1.0; // not supported
// ad.fetch_or(-0.0); // not supported
// have to implement the CAS loop ourselves:
double desired, expected = ad.load(std::memory_order_relaxed);
do {
desired = expected + 1.0;
} while( !ad.compare_exchange_weak(expected, desired) ); // seq_cst
}
mov rax, QWORD PTR ad[rip]
movsd xmm1, QWORD PTR .LC0[rip]
mov QWORD PTR [rsp-8], rax # useless store
movq xmm0, rax
mov rax, QWORD PTR [rsp-8] # and reload
.L8:
addsd xmm0, xmm1
movq rdx, xmm0
lock cmpxchg QWORD PTR ad[rip], rdx
je .L5
mov QWORD PTR [rsp-8], rax
movsd xmm0, QWORD PTR [rsp-8]
jmp .L8
.L5:
ret
compare_exchange always does a bitwise comparison, so you don't need to worry about the fact that negative zero (-0.0) compares equal to +0.0 in IEEE semantics, or that NaN is unordered. This could be an issue if you try to check that desired == expected and skip the CAS operation, though. For new enough compilers, memcmp(&expected, &desired, sizeof(double)) == 0 might be a good way to express a bitwise comparison of FP values in C++. Just make sure you avoid false positives; false negatives will just lead to an unneeded CAS.
Hardware-arbitrated lock or [mem], 1 is definitely better than having multiple threads spinning on lock cmpxchg retry loops. Every time a core gets access to the cache line but fails its cmpxchg is wasted throughput compared to integer memory-destination operations that always succeed once they get their hands on a cache line.
Some special cases for IEEE floats can be implemented with integer operations. e.g. absolute value of an atomic<double> could be done with lock and [mem], rax (where RAX has all bits except the sign bit set). Or force a float / double to be negative by ORing a 1 into the sign bit. Or toggle its sign with XOR. You could even atomically increase its magnitude by 1 ulp with lock add [mem], 1. (But only if you can be sure it wasn't infinity to start with... nextafter() is an interesting function, thanks to the very cool design of IEEE754 with biased exponents that makes carry from mantissa into exponent actually work.)
There's probably no way to express this in C++ that will let compilers do it for you on targets that use IEEE FP. So if you want it, you might have to do it yourself with type-punning to atomic<uint64_t> or something, and check that FP endianness matches integer endianness, etc. etc. (Or just do it only for x86. Most other targets have LL/SC instead of memory-destination locked operations anyway.)
can't yet support something like atomic AVX/SSE vector because it's CPU-dependent
Correct. There's no way to detect when a 128b or 256b store or load is atomic all the way through the cache-coherency system. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490). Even a system with atomic transfers between L1D and execution units can get tearing between 8B chunks when transferring cache-lines between caches over a narrow protocol. Real example: a multi-socket Opteron K10 with HyperTransport interconnects appears to have atomic 16B loads/stores within a single socket, but threads on different sockets can observe tearing.
But if you have a shared array of aligned doubles, you should be able to use vector loads/stores on them without risk of "tearing" inside any given double.
Per-element atomicity of vector load/store and gather/scatter?
I think it's safe to assume that an aligned 32B load/store is done with non-overlapping 8B or wider loads/stores, although Intel doesn't guarantee that. For unaligned ops, it's probably not safe to assume anything.
If you need a 16B atomic load, your only option is to lock cmpxchg16b, with desired=expected. If it succeeds, it replaces the existing value with itself. If it fails, then you get the old contents. (Corner-case: this "load" faults on read-only memory, so be careful what pointers you pass to a function that does this.) Also, the performance is of course horrible compared to actual read-only loads that can leave the cache line in Shared state, and that aren't full memory barriers.
16B atomic store and RMW can both use lock cmpxchg16b the obvious way. This makes pure stores much more expensive than regular vector stores, especially if the cmpxchg16b has to retry multiple times, but atomic RMW is already expensive.
The extra instructions to move vector data to/from integer regs are not free, but also not expensive compared to lock cmpxchg16b.
# xmm0 -> rdx:rax, using SSE4
movq rax, xmm0
pextrq rdx, xmm0, 1
# rdx:rax -> xmm0, again using SSE4
movq xmm0, rax
pinsrq xmm0, rdx, 1
In C++11 terms:
atomic<__m128d> would be slow even for read-only or write-only operations (using cmpxchg16b), even if implemented optimally. atomic<__m256d> can't even be lock-free.
alignas(64) atomic<double> shared_buffer[1024]; would in theory still allow auto-vectorization for code that reads or writes it, only needing to movq rax, xmm0 and then xchg or cmpxchg for atomic RMW on a double. (In 32-bit mode, cmpxchg8b would work.) You would almost certainly not get good asm from a compiler for this, though!
You can atomically update a 16B object, but atomically read the 8B halves separately. (I think this is safe with respect to memory-ordering on x86: see my reasoning at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835).
However, compilers don't provide any clean way to express this. I hacked up a union type-punning thing that works for gcc/clang: How can I implement ABA counter with c++11 CAS?. But gcc7 and later won't inline cmpxchg16b, because they're re-considering whether 16B objects should really present themselves as "lock-free". (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).

On x86-64 atomic operations are implemented via the LOCK prefix. The Intel Software Developer's Manual (Volume 2, Instruction Set Reference) states
The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions
where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B,
CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG.
Neither of those instructions operates on floating point registers (like the XMM, YMM or FPU registers).
This means that there is no natural way to implement atomic float/double operations on x86-64. While most of those operations could be implemented by loading the bit representation of the floating point value into a general purpose (i.e. integer) register, doing so would severely degrade performance so the compiler authors opted not to implement it.
As pointed out by Peter Cordes in the comments, the LOCK prefix is not required for loads and stores, as those are always atomic on x86-64. However the Intel SDM (Volume 3, System Programming Guide) only guarantees that the following loads/stores are atomic:
Instructions that read or write a single byte.
Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
In particular, atomicity of loads/stores from/to the larger XMM and YMM vector registers is not guaranteed.

Loading 8 chars from memory into an __m256 variable as packed single precision floats

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];

If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256 vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)
We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)
There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.

x86 MUL Instruction from VS 2008/2010

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.
If VS does not compile using MUL, is there a rationale why?

imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).
186 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (According to NASM's appendix B, see also the x86 tag wiki)
When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)
In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.
As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.
The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.

There's three different types of multiply instructions on x86. The first is MUL reg, which does an unsigned multiply of EAX by reg and puts the (64-bit) result into EDX:EAX. The second is IMUL reg, which does the same with a signed multiply. The third type is either IMUL reg1, reg2 (multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm (multiplies reg2 by imm and stores the 32-bit result into reg1).
Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL/IMUL if you actually use the full 64-bit results, e.g. here:
unsigned long long prod(unsigned int a, unsigned int b)
{
return (unsigned long long) a * b;
}
The 2-operand (and 3-operand) versions of IMUL are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.

According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL and MUL always produce the same output).
I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL.
This:
unsigned int func(unsigned int a, unsigned int b)
{
return a * b;
}
Assembles to this (with -O2):
_func:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl %esi, %eax
imull %edi, %eax
movzbl %al, %eax
leave
ret

My intuition tells me that the compiler chose IMUL arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL or signed IMUL. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX. The overflow goes into EDX which is essentially ignored since we only care about the 32-bit result in EAX. Using IMUL will sign-extend into EDX as necessary but again, we don't care since we're only interested in the 32-bit result.

Right after I looked at this question I found MULQ in my generated code when dividing.
The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.
C++ code:
for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
Remainder <<= 32;
Remainder += Num;
Num = Remainder / 1000000000;
Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});
Optimized Generated Assembly
00007FF7715B18E8 lea r9,[rsi-4]
00007FF7715B18EC mov r13,12E0BE826D694B2Fh
00007FF7715B18F6 nop word ptr [rax+rax]
00007FF7715B1900 shl r8,20h
00007FF7715B1904 mov eax,dword ptr [r9]
00007FF7715B1907 add r8,rax
00007FF7715B190A mov rax,r13
00007FF7715B190D mul rax,r8
00007FF7715B1910 mov rcx,r8
00007FF7715B1913 sub rcx,rdx
00007FF7715B1916 shr rcx,1
00007FF7715B1919 add rcx,rdx
00007FF7715B191C shr rcx,1Dh
00007FF7715B1920 imul rax,rcx,3B9ACA00h
00007FF7715B1927 sub r8,rax
00007FF7715B192A mov dword ptr [r9],ecx
00007FF7715B192D lea r9,[r9-4]
00007FF7715B1931 lea rax,[r9+4]
00007FF7715B1935 cmp rax,r14
00007FF7715B1938 jne NumToString+0D0h (07FF7715B1900h)
Notice the MUL instruction 5 lines down.
This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.
I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.
This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.

As already explained C/C++ does not do word*word to double-word operations which is what the mul instruction is best for. But there are cases where you want word*word to double-word so you need an extension to C/C++.
GCC, Clang, and ICC provide provide a builtin type __int128 which you can use to indirectly get the mul instruction.
With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128 type with MSVC.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js