Intel store instructions on delibrately overlapping memory regions

Intel store instructions on delibrately overlapping memory regions - c++

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had
reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);
and compiling by clang gives
vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128 xmmword ptr [rsi], ymm0, 1
If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?
Other ways to store YMM into size 3 arrays are welcomed as well.

There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).
It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.
The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.
In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.
However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:
Operation
MEM[loaddr+127:loaddr] := a[127:0]
MEM[hiaddr+127:hiaddr] := a[255:128]
So it's strictly defined as low address store first (the second arg; I think you got this backwards).
And all of that is also moot because there's a more efficient way
Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).
On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)
On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.
You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.
The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.
vextractf128 xmm1, ymm0, 1 ; single uop everywhere
vmovupd [rdi], xmm0 ; single uop everywhere
vmovsd [rdi+2*8], xmm1 ; single uop everywhere
So you want something like this, which should compile efficiently.
_mm_store_pd(vec, _mm256_castpd256_pd128(reg)); // low half
__m128d hi = _mm256_extractf128_pd(reg, 1);
_mm_store_sd(vec+2, hi);
// or vec[2] = _mm_cvtsd_f64(hi);
vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.
There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.
Masked store:
As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.
Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.
On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.
On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).
So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.
vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

Related

Arbitrary position 2-input shuffling using SSE

I have two 4 component vectors which I load into two __m128 variables.
Then I need to shuffle those so that the result looks like this:
Given:
__m128 mmMin = _mm_load_ps(&glm::vec4(-1.0f,-2.0f,-3.0f,-4.0f)[0]);
__m128 mmMax = _mm_load_ps(&glm::vec4(1.0f,2.0f,3.0f,4.0f)[0]);
I want the result of the shuffle to look like this:
// {mmMin.x,mmMax.x,mmMin.x,mmMax.x}
But I see it is not possible to do with _mm_shuffle_ps.
From SSE docs I see _mm_shuffle_ps masks always
inserts into result 2 values from the lower 2 components of __m128 first,then 2 from the high 2 components.
SPU intrinsics have si_shufb method which allows defining qword based mask and shuffle whatever position I wish. Is there a similar method in SSE?
I am using SSE2, but will be happy also to see how it can be done with other versions, including AVX.

With only SSE2 I think you need 2 shuffles: unpcklps to interleave and then unpcklpd same,same or shufps same,same to broadcast the low 64 bits.
With AVX512F, vpermt2ps can do this in one shuffle (using a control vector); I don't think there are any 2-source shuffles in AVX2 or earlier with fine enough granularity and flexible source locations before that. And no fixed shuffles that duplicate an element along with interleaving.
2-source shuffles are rare until AVX512: mostly fixed shuffles like unpckl/h* and palignr. It's mostly just [v]shufps / [v]shufpd until then. Variable-control shuffles are also rare: until AVX, the only one is pshufb. AVX1/2 added some variable-control dword-element shuffles, but only for 1 source. There are no variable-control 2-source shuffles until AVX512.
Immediate shuffles would need more than 4 groups of 2-bit indices to handle arbitrary indexing into the concatenation of two 4-element vectors. But x86 SIMD instructions always have at most an 8-bit immediate operand. Unfortunately no broadcast-immediate like ARM has that could efficiently create a vector of 1.0f or whatever.
AVX
Since you only need 1 element from each vector, instead of loading a whole vector you can use an AVX broadcast-load and then vblendps
Broadcast-loads are the same cost as normal loads on Intel CPUs (don't cost you a uop for the shuffle port, purely handled in the load port). They can't fold into memory operands for ALU instructions until AVX512F, but they do avoid shuffle-port bottlenecks. AMD CPUs may still need an ALU uop but they have more shuffle ALUs so shuffle throughput isn't a bottleneck nearly as much. (https://agner.org/optimize/)
Ryzen vbroadcastss xmm, [mem] is 2 separate uops for the front-end unfortunately, but it still has 2-per-clock throughput.
blend-immediate on dword and later elements is very efficient and can run on any port on Haswell and later, or 2 ports on SnB/IvB and Ryzen. But still single uop / 1c latency even on Nehalem.
#include <immintrin.h>
__m128 broadcast_interleave_scalars_avx(const float *min, const float *max) {
__m128 minx = _mm_broadcast_ss(min);
__m128 maxx = _mm_broadcast_ss(max);
return _mm_blend_ps(minx, maxx, 0b1010);
}
On Godbolt, clang's asm comments confirm that I got the blend constant right:
vbroadcastss xmm0, dword ptr [rdi]
vbroadcastss xmm1, dword ptr [rsi]
vblendps xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
If your data was already in registers, not freshly loaded, you might want to just use 2 shuffles.
With SSE4.1 you might be able to do 2x movddup loads to broadcast 64 bits from memory (including the 32 bits you care about) then blendps. The first load will load 32 bits past the float you care about, the 2nd will load 32 bits before the float you care about.
To get a C++ compiler to emit this for you you'll have to pointer-cast to double* for the __m128d _mm_loaddup_pd (double const* mem_addr) loads, and then use _mm_castpd_ps to get __m128 from __m128d.
https://www.felixcloutier.com/x86/movsldup could also be useful to set up for unpcklps.

fastest way to convert two-bit number to low-memory representation

I have a 56-bit number with potentially two set bits, e.g., 00000000 00000000 00000000 00000000 00000000 00000000 00000011. In other words, two bits are distributed among 56 bits, so that we have bin(56,2)=1540 possible permutations.
I now look for a loss-free mapping of such an 56 bit number to an 11-bit number that can carry 2048 and therefore also 1540. Knowing the structure, this 11-bit number is enough to store the value of my low-density (of ones) 56 bit number.
I want to maximize performance (this function should run millions or even billions of times per second if possible). So far, I only came up with some loop:
int inputNumber = 24; // 11000
int bitMask = 1;
int bit1 = 0, bit2 = 0;
for(int n = 0; n < 54; ++n, bitMask *= 2)
{
if((inputNumber & bitMask) != 0)
{
if(bit1 != 0)
bit1 = n;
else
{
bit2 = n;
break;
}
}
}
and using these two bits, I can easily generate some 1540 max number.
But is there no faster version than using such a loop?

Most ISAs have hardware support for a bit-scan instruction that finds the position of a set bit. Use that instead of a naive loop or bithack for any architecture where you care about this running fast. https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious has some tricks that are better than nothing, but those are all still much worse than a single efficient asm instruction.
But ISO C++ doesn't portably expose clz/ctz operations; it's only available via intrinsics / builtins for various implementations. (And the x86 intrinsincs have quirks for all-zero input, corresponding to the asm instruction behaviour).
For some ISAs, it's a count-leading-zeros giving you 31 - highbit_index. For others, it's a CTZ count trailing zeros operation, giving you the index of the low bit. x86 has both. (And its high-bit finder actually directly finds the high-bit index, not a leading-zero count, unless you use BMI1 lzcnt instead of traditional bsr) https://en.wikipedia.org/wiki/Find_first_set has a table of what different ISAs have.
GCC portably provides __builtin_clz and __builtin_ctz; on ISAs without hardware support, they compile to a call to a helper functions. See What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? and Implementation of __builtin_clz
(For 64-bit integers, you want the long long versions: like __builtin_ctzll GCC manual.)
If we only have a CLZ, use high=63-CLZ(n) and low= 63-CLZ((-n) & n) to isolate the low bit. Note that x86's bsr instruction actually produces 63-CLZ(), i.e. the bit-index instead of the leading-zero count. So 63-__builtin_clzll(n) can compile to a single instruction on x86; IIRC gcc does notice this. Or 2 instructions if GCC uses an extra xor-zeroing to avoid the inconvenient false dependency.
If we only have CTZ, do low = CTZ(n) and high = CTZ(n & (n - 1)) to clear the lowest set bit. (Leaving the high bit, assuming the number has exactly 2 set bits).
If we have both, low = CTZ(n) and high = 63-CLZ(n). I'm not sure what GCC does on non-x86 ISAs where they aren't both available natively. The GCC builtins are always available even when targeting HW that doesn't have it. But the internal implementation can't use the above tricks because it doesn't know there are always exactly 2 bits set.
(I wrote out the full formulas; an earlier version of this answer had CLZ and CTZ reversed in this part. I find that happens to me easily, especially when I also have to keep track of x86's bsr and bsr (bitscan reverse and forward) and remember that those are leading and trailing, respectively.)
So if you just use both CTZ and CLZ, you might end up with slow emulation for one of them. Or fast emulation on ARM with rbit to bit-reverse for clz, which is 100% fine.
AVX512CD has SIMD VPLZCNTQ for 64-bit integers, so you could encode 2, 4, or 8x 64-bit integers in parallel with that on recent Intel CPUs. For SSSE3 or AVX2, you can build a SIMD lzcnt by using pshufb _mm_shuffle_epi8 byte-shuffle as a 4-bit LUT and combining with _mm_max_epu8. There was a recent Q&A about this but I can't find it. (It might have been for 16-bit integers only; wider requires more work.)
With this, a Skylake-X or Cascade Lake CPU could maybe compress 8x 64-bit integers per 2 or 3 clock cycles once you factor in the throughput cost of packing the results. SIMD is certainly useful for packing 12-bit or 11-bit results into a contiguous bitstream, e.g. with variable-shift instructions, if that's what you want to do with the results. At ~3 or 4GHz clock speed, that could maybe get you over 10 billion per clock with a single thread. But only if the inputs come from contiguous memory. Depending what you want to do with the results, it might cost a few more cycles to do more than just pack them down to 16-bit integers. e.g. to pack into a bitstream. But SIMD should be good for that with variable-shift instructions that can line up the 11 or 12 bits from each register into the right position to OR together after shuffling.
There's a tradeoff between coding efficiency and encode performance. Using 12 bits for two 6-bit indices (of bit positions) is very simple both to compress and decompress, at least on hardware that has bit-scan instructions.
Or instead of bit-indices, one or both could be leading zero counts, so decoding would be (1ULL << 63) >> a. 1ULL>>63 is a fixed constant that you can actually right-shift, or the compiler could turn it into a left-shift of 1ULL << (63-a) which IIRC optimizes to 1 << (-a) in assembly for ISAs like x86 where shift instructions mask the shift count (look only at the low 6 bits).
Also, 2x 12 bits is a whole number of bytes, but 11 bits only gives you a whole number of bytes every 8 outputs, if you're packing them. So indexing a bit-packed array is simpler.
0 is still a special case: maybe handle that by using all-ones bit-indices (i.e. index = bit 63, which is outside the low 56 bits). On decode/decompress, you set the 2 bit positions (1ULL<<a) | (1ULL<<b) and then & mask to clear high bits. Or bias your bit indices and have decode right shift by 1.
If we didn't have to handle zero then a modern x86 CPU could do 1 or 2 billion encodes per second if it didn't have to do anything else. e.g. Skylake has 1 per clock throughput for bit-scan instructions and should be able to encode at 1 number per 2 clocks just bottlenecked on that. (Or maybe better with SIMD). With just 4 scalar instructions, we can get the low and high indices (64-bit tzcnt + bsr), shift by 6 bits, and OR together.1 Or on AMD, avoid bsr / bsf and manually do 63-lzcnt.
A branchy or branchless check for input == 0 to to set the final result to whatever hard-coded constant (like 63 , 63) should be cheap, though.
Compression on other ISAs like AArch64 is also cheap. It has clz but not ctz. Probably your best bet there is use an intrinsic for rbit to bit-reverse a number (so clz on the bit-reversed number directly gives you the bit-index of the low bit. Which is now the high bit of the reversed version.) Assuming rbit is as fast as add / sub, this is cheaper than using multiple instructions to clear the low bit.
If you really want 11 bits then you need to avoid the redundancy of 2x 6-bit being able to have either index larger than the other. Like maybe have 6-bit a and 5-bit b, and have a<=b mean something special like b+=32. I haven't thought this through fully. You need to be able to encode 2 adjacent bits either near the top or bottom of the registers, or the 2 set bits could be as far apart as 28 bits, if we consider wrapping at the boundaries like a 56-bit rotate.
Melpomene's suggestion to isolate the low and high set bits might be useful as part of something else, but is only useful for encoding on targets where you only have one direction of bit-scan available, not both. Even so, you wouldn't actually use both expressions. Leading-zero count doesn't require you to isolate the low bit, you just need to clear it to get at the high bit.
Footnote 1: decoding on x86 is also cheap: x |= (1<<a) is 1 instruction: bts. But many compilers have missed optimizations and don't notice this, instead actually shifting a 1. bts reg, reg is 1 uop / 1 cycle latency on Intel since PPro, or sometimes 2 uops on AMD. (Only the memory destination version is slow.) https://agner.org/optimize/
Best encoding performance on AMD CPUs requires BMI1 tzcnt / lzcnt because bsr and bsf are slower (6 uops instead of 1 https://agner.org/optimize/). On Ryzen, lzcnt is 1 uop, 1c latency, 4 per clock throughput. But tzcnt is 2 uops.
With BMI1, the compiler could use blsr to clear the lowest set bit of a register (and copy it). i.e. modern x86 has an instruction for dst = (SRC-1) bitwiseAND ( SRC ); that are single-uop on Intel but 2 uops on AMD.
But with lzcnt being more efficient than tzcnt on AMD Ryzen, probably the best asm for AMD doesn't use it.
Or maybe something like this (assuming exactly 2 bits, which apparently we can do).
(This asm is what you'd like to get your compiler to emit. Don't actually use inline asm!)
Ryzen_encode_scalar: ; input in RDI, output in EAX
lzcnt rcx, rdi ; 63-high bit index
tzcnt rdx, rdi ; low bit
mov eax, 63
sub eax, ecx
shl edx, 6
or eax, edx ; (low_bit << 6) | high_bit
ret ; goes away with inlining.
Shifting the low bit-index balances the lengths of the critical path, allowing better instruction-level parallelism, if we need 63-CLZ for the high bit.
Throughput: 7 uops total, and no execution-unit bottlenecks. So at 5 uops per clock pipeline width, that's better than 1 per 2 clocks.
Skylake_encode_scalar: ; input in RDI, output in EAX
tzcnt rax, rdi ; low bit. No false dependency on Skylake. GCC will probably xor-zero RAX because there is on Broadwell and earlier.
bsr rdi, rdi ; high bit index. same,same reg avoids false dep
shl eax, 6
or eax, edx
ret ; goes away with inlining.
This has 5 cycle latency from input to output: bitscan instructions are 3 cycles on Intel vs. 1 on AMD. SHL + OR each add 1 cycle.
For throughput, we only bottleneck on one bit-scan per cycle (execution port 1), so we can do one encode per 2 cycles with 4 uops of front-end bandwidth left over for load, store, and loop overhead (or something else), assuming we have multiple independent encodes to do.
(But for the multiple independent encode case, SIMD may still be better for both AMD and Intel, if a cheap emulation of vplzcntq exists and the data is coming from memory.)
Scalar decode can be something like this:
decode: ;; input in EDI, output in RAX
xor eax, eax ; RAX=0
bts rax, rdi ; RAX |= 1ULL << (high_bit_idx & 63)
shr edi, 6 ; extract low_bit_idx
bts rax, rdi ; RAX |= 1ULL << low_bit_idx
ret
This has 3 shifts (including the bts) which on Skylake can only run on port0 or port6. So on Intel it only costs 4 uops for the front-end (so 1 per clock as part of doing something else). But if doing only this, it bottlenecks on shift throughput at 1 decode per 1.5 clock cycles.
On a 4GHz CPU, that's 2.666 billion decodes per second, so yeah we're doing pretty well hitting your targets :)
Or Ryzen, bts reg,reg is 2 uops , with 0.5c throughput, but shr can run on any port. So it doesn't steal throughput from bts, and the whole thing is 6 uops (vs. Ryzen's pipeline being 5-wide at the narrowest point). So 1 encode per 1.2 clock cycles, just bottlenecked on front-end cost.
With BMI2 available, starting with a 1 in a register and using shlx rax, rbx, rdi can replace the xor-zeroing + first BTS with a single uop, assuming the 1 in a register can be reused in a loop.
(This optimization is totally dependent on your compiler to find; flag-less shifts are just more efficient ways to copy-and-shift that become available with -march=haswell or -march=znver1, or other targets that have BMI2.)
Either way you're just going to write retval = 1ULL << (packed & 63) for decoding the first bit. But if you're wondering which compilers make nice code here, this is what you're looking for.

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm256_hadd_pd(acc, acc);
result[i] = ((double*)&acc)[0] + ((double*)&acc)[2];
This code works, but I want to replace it with SSE/AVX instruction.

It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.
For a dot-product of an array larger than one vector, sum vertically (into multiple accumulators), only hsumming once at the end.
For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction) - extract the high half and add to the low half. Repeat until you're down to 1 element.
If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.
For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:
#include <immintrin.h>
inline
double hsum_double_avx(__m256d v) {
__m128d vlow = _mm256_castpd256_pd128(v);
__m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
vlow = _mm_add_pd(vlow, vhigh); // reduce down to 128
__m128d high64 = _mm_unpackhi_pd(vlow, vlow);
return _mm_cvtsd_f64(_mm_add_sd(vlow, high64)); // reduce to scalar
}
If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.
Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer
vmovapd xmm1, xmm0 # silly compiler, vextract to xmm1 instead
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm1, xmm0
vunpckhpd xmm1, xmm0, xmm0 # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
vaddsd xmm0, xmm0, xmm1 # scalar means we never raise FP exceptions for results we don't use
vzeroupper
ret
After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.
On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.
#PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)
On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.
# Paul's version # Ryzen # Skylake
vhaddpd ymm0, ymm0, ymm0 # 8 uops # 3 uops
vperm2f128 ymm1, ymm0, ymm0, 49 # 8 uops # 1 uop
vaddpd ymm0, ymm0, ymm1 # 2 uops # 1 uop
# total uops: # 18 # 5
vs.
# my version with vmovapd optimized out: extract to a different reg
vextractf128 xmm1, ymm0, 0x1 # 1 uop # 1 uop
vaddpd xmm0, xmm1, xmm0 # 1 uop # 1 uop
vunpckhpd xmm1, xmm0, xmm0 # 1 uop # 1 uop
vaddsd xmm0, xmm0, xmm1 # 1 uop # 1 uop
# total uops: # 4 # 4
Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)
The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.
Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.

You can do it like this:
acc = _mm256_hadd_pd(acc, acc); // horizontal add top lane and bottom lane
acc = _mm256_add_pd(acc, _mm256_permute2f128_pd(acc, acc, 0x31)); // add lanes
result[i] = _mm256_cvtsd_f64(acc); // extract double
Note: if this is in a "hot" (i.e. performance-critical) part of your code (especially if running on an AMD CPU) then you might instead want to look at Peter Cordes's answer regarding more efficient implementations.

In gcc and clang SIMD types are built-in vector types. E.g.:
# avxintrin.h
typedef double __m256d __attribute__((__vector_size__(32), __aligned__(32)));
These built-in vectors support indexing, so you can write it conveniently and leave it up to the compiler to make good code:
double hsum_double_avx2(__m256d v) {
return v[0] + v[1] + v[2] + v[3];
}
clang-14 -O3 -march=znver3 -ffast-math generates the same assembly as it does for Peter Cordes's intrinsics:
# clang -O3 -ffast-math
hsum_double_avx2:
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
vzeroupper
ret
Unfortunately gcc does much worse, which generates sub-optimal instructions, not taking advantage of the freedom to re-associate the 3 + operations, and using vhaddpd xmm to do the v[0] + v[1] part, which costs 4 uops on on Zen 3. (Or 3 uops on Intel CPUs, 2 shuffles + an add.)
-ffast-math is of course necessary for the compiler to be able to do a good job, unless you write it as (v[0]+v[2]) + (v[1]+v[3]). With that, clang still makes the same asm with -O3 -march=icelake-server without -ffast-math.
Ideally, I want to write plain code as I did above and let the compiler use a CPU-specific cost model to emit optimal instructions in right order for that specific CPU.
One reason being that labour-intensive hand-coded optimal version for Haswell may well be suboptimal for Zen3. For this problem specifically, that's not really the case: starting by narrowing to 128-bit with vextractf128 + vaddpd is optimal everywhere. There are minor variations in shuffle throughput on different CPUs; for example Ice Lake and later Intel can run vshufps on port 1 or 5, but some shuffles like vpermilps/pd or vunpckhpd still only on port 5. Zen 3 (like Zen 2 and 4) has good throughput for either of those shuffles so clang's asm happens to be good there. But it's unfortunate that clang -march=icelake-server still uses vpermilpd
A frequent use-case nowadays is computing in the cloud with diverse CPU models and generations, compiling the code on that host with -march=native -mtune=native for best performance.
In theory, if compilers were smarter, this would optimize short sequences like this to ideal asm, as well as making generally good choices for heuristics like inlining and unrolling. It's usually the best choice for a binary that will run on only one machine, but as GCC demonstrates here, the results are often far from optimal. Fortunately modern AMD and Intel aren't too different most of the time, having different throughputs for some instructions but usually being single-uop for the same instructions.

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2?
I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict.
Basically I need an AVX2 equivalent for
__mm256i detect_conflict(__mm256i a) {
__mm256i cd = _mm256_conflict_epi32(a);
return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0));
}
The only way I could think of is to use _mm256_permutevar8x32_epi32() shift each value right by 1 (across the lanes) and than do seven compares, mask out the unsed bits and than _mm256_or_si256() them together which is horribly slow.

TL:DR: Since full detection of which elements conflict is expensive, it's probably worth doing more fall-back work in exchange for cheaper detection. This depends on your conflict-handling options / strategies.
I came up with a fairly efficient way check for presence/absence of conflicts without finding their locations, like this answer for 64-bit integer elements. It's actually faster than Skylake-AVX512's micro-coded vpconflictd ymm, but of course it gives you much less information. (KNL has fast vpconflictd).
You could use a fully-scalar fallback for all the elements if there are any conflicts. This would work well if conflicts are rare enough that branch-mispredicts don't kill performance. (AVX2 doesn't have scatter instructions in the first place, though, so I'm not sure exactly what you need this for.)
The only-left or only-right behaviour is hard, but my method can give you a mask of which elements have conflicts with any other element (e.g. v[0] == v[3] would result in both conflict[0] and conflict[3] being true). This costs only 1 extra shuffle, or maybe 0 with a redesign with this goal in mind.
(I misread the question at first; I thought you wanted to check both directions, rather than talking about two different implementation options for most of what vpconflictd does. Actually at first I thought you just wanted a presence/absence check, like bool any_conflicts(__m256i).)
Finding presence/absence of any conflicts: bool any_conflicts32(__m256i)
8 choose 2 is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons. We should aim to do it with 4 vector compares, which leaves room for some redundancy.
Creating inputs for those compares will require shuffles, and some of those will have to be lane-crossing. 4 unique comparisons require at least 4 vectors (including the initial unshuffled copy), since 3 choose 2 is only 3.
Ideally as few as possible of the shuffles are lane-crossing, and there is lots of ILP for the compares and ORing of compare results. Also nice if the shuffles don't need a vector shuffle-control, just an imm8. Also good if they're not slow on AMD Ryzen, where 256b instructions are decoded into multiple 128b uops. (Some shuffles are worse than others for this, e.g. vperm2i128 is very bad; much worse than vpermq for swapping the high and low halves of a single vector. Unfortunately clang gets this wrong even with -mtune=znver1, and compiles _mm256_permute4x64_epi64 into vperm2i128 whenever it can).
I found a solution pretty early that achieves most of these goals: 3 shuffles, 4 compares. One of the shuffles is in-lane. All of them use an immediate control byte instead of a vector.
// returns a 0 or non-zero truth value
int any_conflicts32(__m256i v)
{
__m256i hilo = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(1,0,3,2)); // vpermq is much more efficient than vperm2i128 on Ryzen and KNL, same on HSW/SKL.
__m256i inlane_rotr1 = _mm256_shuffle_epi32(v, _MM_SHUFFLE(0,3,2,1));
__m256i full_rotl2 = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(2,1,0,3));
__m256i v_ir1 = _mm256_cmpeq_epi32(v, inlane_rotr1);
__m256i v_hilo= _mm256_cmpeq_epi32(v, hilo); // only really needs to be a 128b operation on the low lane, with leaving the upper lane zero.
// But there's no ideal way to express that with intrinsics, since _mm256_castsi128_si256 technically leaves the high lane undefined
// It's extremely likely that casting down and back up would always compile to correct code, though (using the result in a zero-extended register).
__m256i hilo_ir1 = _mm256_cmpeq_epi32(hilo, inlane_rotr1);
__m256i v_fl2 = _mm256_cmpeq_epi32(v, full_rotl2);
__m256i t1 = _mm256_or_si256(v_ir1, v_hilo);
__m256i t2 = _mm256_or_si256(t1, v_fl2);
__m256i conflicts = _mm256_or_si256(t2, hilo_ir1); // A serial dep chain instead of a tree is probably good because of resource conflicts from limited shuffle throughput
// if you're going to branch on this, movemask/test/jcc is more efficient than ptest/jcc
unsigned conflict_bitmap = _mm256_movemask_epi8(conflicts); // With these shuffles, positions in the bitmap aren't actually meaningful
return (bool)conflict_bitmap;
return conflict_bitmap;
}
How I designed this:
I made a table of all the element-pairs that needed to be checked, and made columns for which shuffled operands could take care of that requirement.
I started with a few shuffles that could be done cheaply, and it turned out my early guesses worked well enough.
My design notes:
// 7 6 5 4 | 3 2 1 0
// h g f e | d c b a
// e h g f | a d c b // inlanerotr1 = vpshufd(v)
// f e d c | b a h g // fullrotl2 = vpermq(v)
// d c b a | h g f e // hilo = vperm2i128(v) or vpermq. v:hilo has lots of redundancy. The low half has all the information.
v:lrot1 v:frotr2 lrotr1:frotl2 (incomplete)
* ab [0]v:lrotr1 [3]lr1:fl2
* ac [2]v:frotl2
* ad [3]v:lrotr1 [2]lr1:fl2
* ae [0,4]v:hilo
* af [4]hilo:lrotr1
* ag [0]v:frotl2
* ah [3]hilo:lrotr1
* bc [1]v:lrotr1
* bd [3]v:frotl2 [5]hilo:frotl2
* be [0]hilo:lrotr1
* bf [1,5]v:hilo
* bg [0]lr1:fl2 [5]hilo:lrotr1
* bh [1]v:frotl2
* cd [2]v:lrotr1
* ce [4]v:frotl2 [4]lr1:fl2
* cf [1]hilo:lrotr1
* cg [2,6]v:hilo
* ch [1]lr1:fl2 [6]hilo:lrotr1
* de [7]hilo:lrotr1
* df [5]v:frotl2 [7]hilo:frotl2
* dg [5]lr1:fl2 [2]hilo:lrotr1
* dh [3,7]v:hilo
* ef [4]v:lrotr1 [7]lr1:fl2
* eg [6]v:frotl2
* eh [7]v:lrotr1 [6]lr1:fl2
* fg [5]v:lrotr1
* fh [7]v:frotl2
* gh [6]v:lrotr1
*/
It turns out that in-lane rotr1 == full rotl2 has a lot of redundancy, so it's not worth using. It also turns out that having all the allowed redundancy in v==hilo works fine.
If you care about which result is in which element (rather than just checking for presence/absence),
then v == swap_hilo(lrotr1) could work instead of lrotr1 == hilo.
But we also need swap_hilo(v), so this would mean an extra shuffle.
We could instead shuffle after hilo==lrotr1, for better ILP.
Or maybe there's a different set of shuffles that gives us everything.
Maybe if we consider VPERMD with a vector shuffle-control...
Compiler asm output vs. optimal asm
gcc6.3 -O3 -march=haswell produces:
Haswell has one shuffle unit (on port5).
# assume ymm0 ready on cycle 0
vpermq ymm2, ymm0, 78 # hilo ready on cycle 3 (execution started on cycle 0)
vpshufd ymm3, ymm0, 57 # lrotr1 ready on cycle 2 (started on cycle 1)
vpermq ymm1, ymm0, 147 # frotl2 ready on cycle 5 (started on 2)
vpcmpeqd ymm4, ymm2, ymm0 # starts on 3, ready on 4
vpcmpeqd ymm1, ymm1, ymm0 # starts on 5, ready on 6
vpcmpeqd ymm2, ymm2, ymm3 # starts on 3, ready on 4
vpcmpeqd ymm0, ymm0, ymm3 # starts on 2, ready on 3
vpor ymm1, ymm1, ymm4 # starts on 6, ready on 7
vpor ymm0, ymm0, ymm2 # starts on 4, ready on 5
vpor ymm0, ymm1, ymm0 # starts on 7, ready on 8
# a different ordering of VPOR merging could have saved a cycle here. /scold gcc
vpmovmskb eax, ymm0
vzeroupper
ret
So the best-case latency is 8 cycles to have a single vector ready, given resource conflicts from other instructions in this sequence but assuming no conflicts with past instructions still in the pipeline. (Should have been 7 cycles, but gcc re-ordered the dependency structure of my intrinsics putting more stuff dependent on the compare of the last shuffle result.)
This is faster than Skylake-AVX512's vpconflictd ymm, which has 17c latency, one per 10c throughput. (Of course, that gives you much more information, and #harold's emulation of it takes many more instructions).
Fortunately gcc didn't re-order the shuffles and introduce a potential write-back conflict. (e.g. putting the vpshufd last would mean that dispatching the shuffle uops to port5 in oldest-first order would have the vpshufd ready in the same cycle as the first vpermq (1c latency vs. 3c).) gcc did this for one version of the code (where I compared the wrong variable), so it seems that gcc -mtune=haswell doesn't take this into account. (Maybe it's not a big deal, I haven't measured to see what the real effect on latency is. I know the scheduler is smart about picking uops from the Reservation Station to avoid actual write-back conflicts, but IDK how smart it is, i.e. whether it would run the vpshufd ahead of a later vpermq to avoid a write-back conflict, since it would have to look-ahead to even see the upcoming writeback conflict. More likely it would just delay the vpshufd for an extra cycle before dispatching it.)
Anyway, this is why I put _mm_shuffle_epi32 in the middle in the C source, where it makes things easy for OOO execution.
Clang 4.0 goes berserk and packs each compare result down to 128b vectors (with vextracti128 / vpacksswb), then expands back to 256b after three vpor xmm before pmovmskb. I thought at first it was doing this because of -mtune=znver1, but it does it with -mtune=haswell as well. It does this even if we return a bool, which would let it just pmovmskb / test on the packed vector. /facepalm. It also pessimizes the hilo shuffle to vperm2i128, even with -mtune=znver1 (Ryzen), where vperm2i128 is 8 uops but vpermq is 3. (Agner Fog's insn tables for some reasons missed those, so I took those numbers from the FP equivalents vperm2f128 and vpermpd)
#harold says that using add instead of or stops clang from packing/unpacking, but vpaddd has lower throughput than vpor on Intel pre-Skylake.
Even better for Ryzen, the v == hilo compare can do only the low half. (i.e. use vpcmpeqd xmm2, xmm2, xmm3, which is only 1 uop instead of 2). We still need the full hilo for hilo == lrot1, though. So we can't just use vextracti128 xmm2, xmm0, 1 instead of the vpermq shuffle. vextracti128 has excellent performance on Ryzen: 1 uop, 1c latency, 0.33c throughput (can run on any of P0/1/3).
Since we're ORing everything together, it's fine to have zeros instead of redundant compare results in the high half.
As I noted in comments, IDK how to safely write this with intrinsics. The obvious way would be to use _mm256_castsi128_si256 (_mm_cmpeq_epi32(v, hilo)), but that technically leaves the high lane undefined, rather than zero. There's no sane way a compiler would do anything other than use the full-width ymm register that contains the xmm register with the 128b compare result, but it would be legal according to Intel's docs for a Deathstation-9000 compiler to put garbage there. Any explicit way of getting zeros in the high half would depend on the compiler optimizing it away. Maybe _mm256_setr_si128(cmpresult, _mm_setzero_si128());.
There are no current CPUs with AVX512F but not AVX512CD. But if that combo is interesting or relevant, clang makes some interesting asm from my code with -mavx512f -mavx512vl. It uses EVEX vpcmpeqd into mask registers, and korw to merge them. But then it expands that back into a vector to set up for vpmovmaskb, instead of just optimizing away the movemask and using the korw result. /facepalm.

Sorting 64-bit structs using AVX?

I have a 64-bit struct which represents several pieces of data, one of which is a floating point value:
struct MyStruct{
uint16_t a;
uint16_t b;
float f;
};
and I have four of these structs in, lets say an std::array<MyStruct, 4>
is it possible to use AVX to sort the array, in terms of the float member MyStruct::f?

Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication.
I have 4 separate ideas:
Normal sorting, but moving the struct as a 64bit unit
Vectorized insertion-sort as a building block for qsort
Sorting networks, with a comparator implementation using cmpps / blendvpd instead of minps/maxps. The extra overhead might kill the speedup, though.
Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload. Use Timothy Furtak's technique of doing a normal minps/maxps comparator and then cmpeqps min,orig -> masked xor-swap on the payload. This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators. Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).
This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.
Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot.
Normal sorting
There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element. For this idea to work as well as possible, order your struct with the float value first, then you could movq a whole struct into an xmm reg, and the float value would be in the low32 for ucomiss. Then you (or maybe a smart compiler) could store the struct with a movq.
Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:
icc seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. Maybe it doesn't pack the struct? gcc 5.1 doesn't seem to have that problem most of the time.
Speeding up insertion-sort
Big sorts usually divide-and-conquer with insertion sort for small-enough problems. Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs. So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any. Do you smell vectors? I smell vectors.
# RSI points to struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using! ######
.new_elem:
vbroadcastsd ymm0, [rdi] # broadcast the whole struct
mov rdx, rdi
.search_loop:
sub rdx, 32
vmovups ymm1, [rdx] # load some sorted data
vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
vmovups [rdx+8], ymm1 # shuffle it over to make space, usual insertion-sort style
cmp rdx, rsi
jbe .endsearch # below-or-equal (addresses are unsigned)
movmskps eax, ymm2
test al, 0b01010101 # test only the compare results for
jz .search_loop # [rdi] wasn't less than any of the 4 elements
.endsearch:
# TODO: scalar loop to find out where the new element goes.
# All we know is that it's less than one of the elements in ymm1, but not which
add rdi, 8
vmovsd [rdx], ymm0
cmp rdi, r8 # pointer to the end of the buf
jle .new_elem
# worse alternative to movmskps / test:
# vtestps ymm2, ymm7 # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
# vtestps is like PTEST, but only tests the high bit. If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.
This is certainly full of bugs, and I should have just written it in C with intrinsics.
This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.
Probably pipelining the loop so we haven't stored ymm1 until the next iteration (or after breaking out) would save a redundant store. Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win. This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups, and the low one in another reg for vmovsd.
I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2). However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.
Anyway, if anyone does develop this idea into actual debugged and working code, let us know.
update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort). He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops. (insertion-sort on a mostly-sorted array is fast.)
Which leads us to:
Sorting Networks
There might not be any speedup here. Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. With wider elements, fewer fit in a vector reg. (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.
Sorting networks can be constructed using SSE/AVX packed-compare instructions. (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). They use pshufd to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.
Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element. (i.e. Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload) pairs, where the payload can be an index or 32bit pointer (mmap(MAP_32bit), or x32 ABI).
So let's build ourselves a comparator. In sorting-network parlance, that's an operation that sorts a pair of inputs. So it either swaps an elements between registers, or not.
# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; } inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
# UNTESTED
You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float. I'm not sure if that happens only for mul/div, or if it would affect compare.
Intel Haswell: Latency: 5 cycles for ymm2 to be ready, 7 cycles for ymm3. Throughput: one per 4 cycles. (p5 bottleneck).
Intel Sandybridge/Ivybridge: Latency: 5 cycles for ymm2 to be ready, 6 cycles for ymm3. Throughput: one per 2 cycles. (p0/p5 bottleneck).
AMD Bulldozer/Piledriver: (vblendvpd ymm: 2c lat, 2c recip tput): lat: 4c for ymm2, 6c for ymm3. Or worse, with bypass delays between cmpps and blend. tput: one per 4c. (bottleneck on vector P1)
AMD Steamroller: (vblendvpd ymm: 2c lat, 1c recip tput): lat: 4c for ymm2, 5c for ymm3. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend).
VBLENDVPD is 2 uops. (It has 3 reg inputs, so it can't be 1 uop :/). Both uops can only run on shuffle ports. On Haswell, that's only port5. On SnB, that's p0/p5. (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.)
If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.
The usual minps/maxps pair is 3 and 4 cycles latency (ymm2/3), and one per 2 cycles throughput (Intel). (p1 bottleneck on the FP add/sub/compare unit). The most fair comparison is probably to sorting 64bit doubles. The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared. The halved throughput on Haswell will cut into any speedups pretty heavily.
Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison. min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies.
With hyperthreading, another thread that can keep the other ports busy (e.g. port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.
I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations. It ended up slower, though, because both sources have to get masked both ways before combining.
# AVX2 comparator for Haswell
# struct { float f; uint16_t a, b; } inputs in ymm0, ymm1
#
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2) # extend the mask to the payload part. There's no mask function, I just don't want to work out the result in my head.
vpand ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where ymm0[i] < ymm1[i]
vpandn ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where !(ymm0[i] < ymm1[i])
vpor ymm2, ymm10, ymm11 # ymm2 = min_packed_mystruct(ymm0, ymm1)
vpandn ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where !(ymm0[i] < ymm1[i])
vpand ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where ymm0[i] < ymm1[i]
vpor ymm3, ymm10, ymm11 # ymm2 = max_packed_mystruct(ymm0, ymm1)
# result: !(ymm2[i] > ymm3[i])
# UNTESTED
This is 8 uops, compared to 5 for the blendv version. There's a lot of parallelism in the last 6 and/andn/or instructions. cmpps has 3 cycle latency, though. I think ymm2 will be ready in 6 cycles, while ymm3 is ready in 7. (And can overlap with operations on ymm2). The insns following a comparator op will probably be shuffles, to put the data in the right elements for the next compare. There's no forwarding delay to/from the shuffle unit for integer-domain logicals, even for a vshufps, but the result should come out in the FP domain, ready for a vcmpps. Using vpand instead of vandps is essential for throughput.
Timothy Furtak's paper suggests an approach for sorting keys with a payload: don't pack the payload pointers with the keys, but instead generate a mask from the compare, and use it on both the keys and the payload the same way. This means you have to separate the payload from the keys either in your data structure, or every time you load a struct.
See the appendix of his paper (Fig. 12). He uses the standard min/max on the keys, and then uses cmpps to see which elements CHANGED. Then he ANDs that mask in the middle of an xor-swap to end up only swapping the payloads for the keys that swapped.

Unfortunately, original AVX has very limited shuffling across its 128-bit halves (i.e. lanes), so it is hard to sort contents of a full 256-bit register. However, AVX2 has shuffling operations without such limitations, so we can perform a sort of 4 structs in vectorized way.
I'll use the idea of this solution. In order to sort an array we have to do enough element comparisons to surely determine the permutation we need to apply. Given that no element is NaN, it is enough to check for each pair of different elements a and b whether a < b and whether a > b. Having this information, we can fully compare any two elements, which must be enough to determine final sorting order. This is 6 pairs of 32-bit elements and two comparison modes, so we can end up doing two shuffles and two comparisons in AVX. If you are absolutely sure that all the elements are distinct, then you can avoid a > b comparisons and reduce size of LUT.
For repacking of elements within register we can use _mm256_permutevar8x32_ps. One instruction allows to do arbitrary shuffle on 32-bit granularity. Note that in the code I assume that sorting key f is the first member of your struct (just as #PeterCordes proposed), but you can trivially use this solution for you current struct if you change shuffling mask accordingly.
After we perform the comparisons, we have a two AVX registers containing boolean results as 32-bit masks. The first six masks in each register are important, the last two are not. Then we want to convert these masks to a small integer in general-purpose register to be used as index in a lookup table. In general case we may have to create perfect hashing for it, but it is not necessary here. We can use _mm256_movemask_ps to get a 8-bit integer mask in general purpose register from AVX register. Since the last two masks per register are not important, we can ensure that they are always zero. Then the resulting index would be in range [0..2^12).
Finally, we load a shuffling mask from precomputed LUT with 4096 elements and pass it to _mm256_permutevar8x32_ps. As a result we obtain an AVX register with 4 properly sorted structs of your type. Precomputing the LUT is your home assignment =)
Here is the final code:
__m256i lut[4096]; //LUT of 128Kb size must be precomputed
__m256 Sort4(__m256 val) {
__m256 aaabbcaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(0, 0, 0, 2, 2, 4, 0, 0));
__m256 bcdcddaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(2, 4, 6, 4, 6, 6, 0, 0));
__m256 cmpLt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_LT_OQ);
__m256 cmpGt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_GT_OQ);
int idxLt = _mm256_movemask_ps(cmpLt);
int idxGt = _mm256_movemask_ps(cmpGt);
__m256i shuf = lut[idxGt * 64 + idxLt];
__m256 res = _mm256_permutevar8x32_ps(val, shuf);
return res;
}
Here you can see generated assembly. There are 14 instructions in total, 2 of them are for loading constant shuffling masks, and one of them is due to useless 32-bit->64-bit conversion of movemask results. So in a tight loop it would be 11-12 instructions. IACA says that four calls in a loop have 16.40 cycles throughput on Haswell, so it seems to achieve throughput 4.1 cycles per call.
Of course 128 Kb lookup table is too much unless you are going to process even more input data in one batch. It may be possible to reduce LUT size with adding perfect hashing (sacrificing speed of course). It is hard to say how much orderings are possible on four elements, but clearly less than 4! * 2^3 = 192. I think 256-element LUT is possible, maybe even 128-element LUT. With perfect hashing it may be faster to combine two AVX registers into one with shift and xor, then do _mm256_movemask_epi8 once (instead of doing two _mm256_movemask_ps and combining them afterwards).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js