Arbitrary position 2-input shuffling using SSE - c++

I have two 4 component vectors which I load into two __m128 variables.
Then I need to shuffle those so that the result looks like this:
Given:
__m128 mmMin = _mm_load_ps(&glm::vec4(-1.0f,-2.0f,-3.0f,-4.0f)[0]);
__m128 mmMax = _mm_load_ps(&glm::vec4(1.0f,2.0f,3.0f,4.0f)[0]);
I want the result of the shuffle to look like this:
// {mmMin.x,mmMax.x,mmMin.x,mmMax.x}
But I see it is not possible to do with _mm_shuffle_ps.
From SSE docs I see _mm_shuffle_ps masks always
inserts into result 2 values from the lower 2 components of __m128 first,then 2 from the high 2 components.
SPU intrinsics have si_shufb method which allows defining qword based mask and shuffle whatever position I wish. Is there a similar method in SSE?
I am using SSE2, but will be happy also to see how it can be done with other versions, including AVX.

With only SSE2 I think you need 2 shuffles: unpcklps to interleave and then unpcklpd same,same or shufps same,same to broadcast the low 64 bits.
With AVX512F, vpermt2ps can do this in one shuffle (using a control vector); I don't think there are any 2-source shuffles in AVX2 or earlier with fine enough granularity and flexible source locations before that. And no fixed shuffles that duplicate an element along with interleaving.
2-source shuffles are rare until AVX512: mostly fixed shuffles like unpckl/h* and palignr. It's mostly just [v]shufps / [v]shufpd until then. Variable-control shuffles are also rare: until AVX, the only one is pshufb. AVX1/2 added some variable-control dword-element shuffles, but only for 1 source. There are no variable-control 2-source shuffles until AVX512.
Immediate shuffles would need more than 4 groups of 2-bit indices to handle arbitrary indexing into the concatenation of two 4-element vectors. But x86 SIMD instructions always have at most an 8-bit immediate operand. Unfortunately no broadcast-immediate like ARM has that could efficiently create a vector of 1.0f or whatever.
AVX
Since you only need 1 element from each vector, instead of loading a whole vector you can use an AVX broadcast-load and then vblendps
Broadcast-loads are the same cost as normal loads on Intel CPUs (don't cost you a uop for the shuffle port, purely handled in the load port). They can't fold into memory operands for ALU instructions until AVX512F, but they do avoid shuffle-port bottlenecks. AMD CPUs may still need an ALU uop but they have more shuffle ALUs so shuffle throughput isn't a bottleneck nearly as much. (https://agner.org/optimize/)
Ryzen vbroadcastss xmm, [mem] is 2 separate uops for the front-end unfortunately, but it still has 2-per-clock throughput.
blend-immediate on dword and later elements is very efficient and can run on any port on Haswell and later, or 2 ports on SnB/IvB and Ryzen. But still single uop / 1c latency even on Nehalem.
#include <immintrin.h>
__m128 broadcast_interleave_scalars_avx(const float *min, const float *max) {
__m128 minx = _mm_broadcast_ss(min);
__m128 maxx = _mm_broadcast_ss(max);
return _mm_blend_ps(minx, maxx, 0b1010);
}
On Godbolt, clang's asm comments confirm that I got the blend constant right:
vbroadcastss xmm0, dword ptr [rdi]
vbroadcastss xmm1, dword ptr [rsi]
vblendps xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
If your data was already in registers, not freshly loaded, you might want to just use 2 shuffles.
With SSE4.1 you might be able to do 2x movddup loads to broadcast 64 bits from memory (including the 32 bits you care about) then blendps. The first load will load 32 bits past the float you care about, the 2nd will load 32 bits before the float you care about.
To get a C++ compiler to emit this for you you'll have to pointer-cast to double* for the __m128d _mm_loaddup_pd (double const* mem_addr) loads, and then use _mm_castpd_ps to get __m128 from __m128d.
https://www.felixcloutier.com/x86/movsldup could also be useful to set up for unpcklps.

Related

Reinterpret casting from __m256i to __m256 [duplicate]

Why does _mm_extract_ps return an int instead of a float?
What's the proper way to read a single float from an XMM register in C?
Or rather, a different way to ask it is: What's the opposite of the _mm_set_ps instruction?
None of the answers appear to actually answer the question, why does it return int.
The reason is, the extractps instruction actually copies a component of the vector to a general register. It does seem pretty silly for it to return an int but that's what's actually happening - the raw floating point value ends up in a general register (which hold integers).
If your compiler is configured to generate SSE for all floating point operations, then the closest thing to "extracting" a value to a register would be to shuffle the value into the low component of the vector, then cast it to a scalar float. This should cause that component of the vector to remain in an SSE register:
/* returns the second component of the vector */
float foo(__m128 b)
{
return _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));
}
The _mm_cvtss_f32 intrinsic is free, it does not generate instructions, it only makes the compiler reinterpret the xmm register as a float so it can be returned as such.
The _mm_shuffle_ps gets the desired value into the lowest component. The _MM_SHUFFLE macro generates an immediate operand for the resulting shufps instruction.
The 2 in the example gets the float from bit 95:64 of the 127:0 register (the 3rd 32 bit component from the beginning, in memory order) and places it in the 31:0 component of the register (the beginning, in memory order).
The resulting generated code will most likely return the value naturally in a register, like any other floating point value return, with no inefficient writing out to memory and reading it back.
If you're generating code that uses the x87 FPU for floating point (for normal C code that isn't SSE optimized), this would probably result in inefficient code being generated - the compiler would probably store out the component of the SSE vector then use fld to read it back into the x87 register stack. In general 64-bit platforms don't use x87 (they use SSE for all floating point, mostly scalar instructions unless the compiler is vectorizing).
I should add that I always use C++, so I'm not sure whether it is more efficient to pass __m128 by value or by pointer in C. In C++ I would use a const __m128 & and this kind of code would be in a header, so the compiler can inline.
Confusingly, int _mm_extract_ps() is not for getting a scalar float element from a vector. The intrinsic doesn't expose the memory-destination form of the instruction (which can be useful for that purpose). This is not the only case where the intrinsics can't directly express everything an instruction is useful for. :(
gcc and clang know how the asm instruction works and will use it that way for you when compiling other shuffles; type-punning the _mm_extract_ps result to float usually results in horrible asm from gcc (extractps eax, xmm0, 2 / mov [mem], eax).
The name makes sense if you think of _mm_extract_ps as extracting an IEEE 754 binary32 float bit pattern out of the FP domain of the CPU into the integer domain (as a C scalar int), instead of manipulating FP bit patterns with integer vector ops. According to my testing with gcc, clang, and icc (see below), this is the only "portable" use-case where _mm_extract_ps compiles into good asm across all compilers. Anything else is just a compiler-specific hack to get the asm you want.
The corresponding asm instruction is EXTRACTPS r/m32, xmm, imm8. Notice that the destination can be memory or an integer register, but not another XMM register. It's the FP equivalent of PEXTRD r/m32, xmm, imm8 (also in SSE4.1), where the integer-register-destination form is more obviously useful. EXTRACTPS is not the reverse of INSERTPS xmm1, xmm2/m32, imm8.
Perhaps this similarity with PEXTRD makes the internal implementation simpler without hurting the extract-to-memory use-case (for asm, not intrinsics), or maybe the SSE4.1 designers at Intel thought it was actually more useful this way than as a non-destructive FP-domain copy-and-shuffle (which x86 seriously lacks without AVX). There are FP-vector instructions that have an XMM source and a memory-or-xmm destination, like MOVSS xmm2/m32, xmm, so this kind of instruction would not be new. Fun fact: the opcodes for PEXTRD and EXTRACTPS differ only in the last bit.
In assembly, a scalar float is just the low element of an XMM register (or 4 bytes in memory). The upper elements of the XMM don't even have to be zeroed for instructions like ADDSS to work without raising any extra FP exceptions. In calling conventions that pass/return FP args in XMM registers (e.g. all the usual x86-64 ABIs), float foo(float a) must assume that the upper elements of XMM0 hold garbage on entry, but can leave garbage in the high elements of XMM0 on return. (More info).
As #doug points out, other shuffle instructions can be used to get a float element of a vector into the bottom of an xmm register. This was already a mostly-solved problem in SSE1/SSE2, and it seems EXTRACTPS and INSERTPS weren't trying to solve it for register operands.
SSE4.1 INSERTPS xmm1, xmm2/m32, imm8 is one of the best ways for compilers to implement _mm_set_ss(function_arg) when the scalar float is already in a register and they can't/don't optimize away zeroing the upper elements. (Which is most of the time for compilers other than clang). That linked question also further discusses the failure of intrinsics to expose the load or store versions of instructions like EXTRACTPS, INSERTPS, and PMOVZX that have a memory operand narrower than 128b (thus not requiring alignment even without AVX). It can be impossible to write safe code that compiles as efficiently as what you can do in asm.
Without AVX 3-operand SHUFPS, x86 doesn't provide a fully efficient and general-purpose way to copy-and-shuffle an FP vector the way integer PSHUFD can. SHUFPS is a different beast unless used in-place with src=dst. Preserving the original requires a MOVAPS, which costs a uop and latency on CPUs before IvyBridge, and always costs code-size. Using PSHUFD between FP instructions costs latency (bypass delays). (See this horizontal-sum answer for some tricks, like using SSE3 MOVSHDUP).
SSE4.1 INSERTPS can extract one element into a separate register, but AFAIK it still has a dependency on the previous value of the destination even when all the original values are replaced. False dependencies like that are bad for out-of-order execution. xor-zeroing a register as a destination for INSERTPS would still be 2 uops, and have lower latency than MOVAPS+SHUFPS on SSE4.1 CPUs without mov-elimination for zero-latency MOVAPS (only Penryn, Nehalem, Sandybridge. Also Silvermont if you include low-power CPUs). The code-size is slightly worse, though.
Using _mm_extract_ps and then type-punning the result back to float (as suggested in the currently-accepted answer and its comments) is a bad idea. It's easy for your code to compile to something horrible (like EXTRACTPS to memory and then load back into an XMM register) on either gcc or icc. Clang seems to be immune to braindead behaviour and does its usual shuffle-compiling with its own choice of shuffle instructions (including appropriate use of EXTRACTPS).
I tried these examples with gcc5.4 -O3 -msse4.1 -mtune=haswell, clang3.8.1, and icc17, on the Godbolt compiler explorer. I used C mode, not C++, but union-based type punning is allowed in GNU C++ as an extension to ISO C++. Pointer-casting for type-punning violates strict aliasing in C99 and C++, even with GNU extensions.
#include <immintrin.h>
// gcc:bad clang:good icc:good
void extr_unsafe_ptrcast(__m128 v, float *p) {
// violates strict aliasing
*(int*)p = _mm_extract_ps(v, 2);
}
gcc: # others extractps with a memory dest
extractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:bad
void extr_pun(__m128 v, float *p) {
// union type punning is safe in C99 (and GNU C and GNU C++)
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
*p = fp.f; // compiles to an extractps straight to memory
}
icc:
vextractps eax, xmm0, 2
mov DWORD PTR [rdi], eax
ret
// gcc:good clang:good icc:horrible
void extr_gnu(__m128 v, float *p) {
// gcc uses extractps with a memory dest, icc does extr_store
*p = v[2];
}
gcc/clang:
extractps DWORD PTR [rdi], xmm0, 2
icc:
vmovups XMMWORD PTR [-24+rsp], xmm0
mov eax, DWORD PTR [-16+rsp] # reload from red-zone tmp buffer
mov DWORD PTR [rdi], eax
// gcc:good clang:good icc:poor
void extr_shuf(__m128 v, float *p) {
__m128 e2 = _mm_shuffle_ps(v,v, 2);
*p = _mm_cvtss_f32(e2); // gcc uses extractps
}
icc: (others: extractps right to memory)
vshufps xmm1, xmm0, xmm0, 2
vmovss DWORD PTR [rdi], xmm1
When you want the final result in an xmm register, it's up to the compiler to optimize away your extractps and do something completely different. Gcc and clang both succeed, but ICC doesn't.
// gcc:good clang:good icc:bad
float ret_pun(__m128 v) {
union floatpun { int i; float f; } fp;
fp.i = _mm_extract_ps(v, 2);
return fp.f;
}
gcc:
unpckhps xmm0, xmm0
clang:
shufpd xmm0, xmm0, 1
icc17:
vextractps DWORD PTR [-8+rsp], xmm0, 2
vmovss xmm0, DWORD PTR [-8+rsp]
Note that icc did poorly for extr_pun, too, so it doesn't like union-based type-punning for this.
The clear winner here is doing the shuffle "manually" with _mm_shuffle_ps(v,v, 2), and using _mm_cvtss_f32. We got optimal code from every compiler for both register and memory destinations, except for ICC which failed to use EXTRACTPS for the memory-dest case. With AVX, SHUFPS + separate store is still only 2 uops on Intel CPUs, just larger code size and needs a tmp register. Without AVX, though, it would cost a MOVAPS to not destroy the original vector :/
According to Agner Fog's instruction tables, all Intel CPUs except Nehalem implement the register-destination versions of both PEXTRD and EXTRACTPS with multiple uops: Usually just a shuffle uop + a MOVD uop to move data from the vector domain to gp-integer. Nehalem register-destination EXTRACTPS is 1 uop for port 5, with 1+2 cycle latency (1 + bypass delay).
I have no idea why they managed to implement EXTRACTPS as a single uop but not PEXTRD (which is 2 uops, and runs in 2+1 cycle latency). Nehalem MOVD is 1 uop (and runs on any ALU port), with 1+1 cycle latency. (The +1 is for the bypass delay between vec-int and general-purpose integer regs, I think).
Nehalem cares a lot of about vector FP vs. integer domains; SnB-family CPUs have smaller (sometimes zero) bypass delay latencies between domains.
The memory-dest versions of PEXTRD and EXTRACTPS are both 2 uops on Nehalem.
On Broadwell and later, memory-destination EXTRACTPS and PEXTRD are 2 uops, but on Sandybridge through Haswell, memory-destination EXTRACTPS is 3 uops. Memory-destination PEXTRD is 2 uops on everything except Sandybridge, where it's 3. This seems odd, and Agner Fog's tables do sometimes have errors, but it's possible. Micro-fusion doesn't work with some instructions on some microarchitectures.
If either instruction had turned out to be extremely useful for anything important (e.g. inside inner loops), CPU designers would build execution units that could do the whole thing as one uop (or maybe 2 for the memory-dest). But that potentially requires more bits in the internal uop format (which Sandybridge simplified).
Fun fact: _mm_extract_epi32(vec, 0) compiles (on most compilers) to movd eax, xmm0 which is shorter and faster than pextrd eax, xmm0, 0.
Interestingly, they perform differently on Nehalem (which cares a lot of about vector FP vs. integer domains, and came out soon after SSE4.1 was introduced in Penryn (45nm Core2)). EXTRACTPS with a register destination is 1 uop, with 1+2 cycle latency (the +2 from a bypass delay between FP and integer domain). PEXTRD is 2 uops, and runs in 2+1 cycle latency.
From the MSDN docs, I believe you can cast the result to a float.
Note from their example, the 0xc0a40000 value is equivalent to -5.125 (a.m128_f32[1]).
Update: I strongly recommend the answers from #doug65536 and #PeterCordes (below) in lieu of mine, which apparently generates poorly performing code on many compilers.
Try _mm_storeu_ps, or any of the variations of SSE store operations.

Intel store instructions on delibrately overlapping memory regions

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had
reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);
and compiling by clang gives
vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128 xmmword ptr [rsi], ymm0, 1
If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?
Other ways to store YMM into size 3 arrays are welcomed as well.
There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).
It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.
The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.
In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.
However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:
Operation
MEM[loaddr+127:loaddr] := a[127:0]
MEM[hiaddr+127:hiaddr] := a[255:128]
So it's strictly defined as low address store first (the second arg; I think you got this backwards).
And all of that is also moot because there's a more efficient way
Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).
On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)
On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.
You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.
The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.
vextractf128 xmm1, ymm0, 1 ; single uop everywhere
vmovupd [rdi], xmm0 ; single uop everywhere
vmovsd [rdi+2*8], xmm1 ; single uop everywhere
So you want something like this, which should compile efficiently.
_mm_store_pd(vec, _mm256_castpd256_pd128(reg)); // low half
__m128d hi = _mm256_extractf128_pd(reg, 1);
_mm_store_sd(vec+2, hi);
// or vec[2] = _mm_cvtsd_f64(hi);
vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.
There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.
Masked store:
As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.
Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.
On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.
On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).
So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.
vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.
So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:
// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
// ...
// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),
gHigh32Permute); // This seems to take 3 cycles
On Intel, your code would be optimal. One 1-uop instruction is the best you will get. (Except you might want to use vpermps to avoid any risk for int / FP bypass delay, if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to integer instructions is usually fine on Intel, but I'm less sure about feeding the result of an FP instruction to an integer shuffle.)
Although if tuning for Intel, you might try changing the surrounding code so you can shuffle into the bottom 64-bits of each 128b lane, to avoid using a lane-crossing shuffle. (Then you could just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)
With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.
On Ryzen, lane-crossing shuffles are slow. Agner Fog doesn't have numbers for vpermd, but he lists vpermps (which probably uses the same hardware internally) at 3 uops, 5c latency, one per 4c throughput.
vextractf128 xmm, ymm, 1 is very efficient on Ryzen (1c latency, 0.33c throughput), not surprising since it tracks 256b registers as two 128b halves already. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.
This also saves you 2 registers for the 2 vpermps shuffle masks you don't need anymore.
So I'd suggest:
__m256d x = /* computed here */;
// Tuned for Ryzen. Sub-optimal on Intel
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));
__m128 odd = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));
On Intel, using 3 shuffles instead of 2 gives you 2/3rds of the optimal throughput, with 1c extra latency for the first result.

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2?
I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict.
Basically I need an AVX2 equivalent for
__mm256i detect_conflict(__mm256i a) {
__mm256i cd = _mm256_conflict_epi32(a);
return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0));
}
The only way I could think of is to use _mm256_permutevar8x32_epi32() shift each value right by 1 (across the lanes) and than do seven compares, mask out the unsed bits and than _mm256_or_si256() them together which is horribly slow.
TL:DR: Since full detection of which elements conflict is expensive, it's probably worth doing more fall-back work in exchange for cheaper detection. This depends on your conflict-handling options / strategies.
I came up with a fairly efficient way check for presence/absence of conflicts without finding their locations, like this answer for 64-bit integer elements. It's actually faster than Skylake-AVX512's micro-coded vpconflictd ymm, but of course it gives you much less information. (KNL has fast vpconflictd).
You could use a fully-scalar fallback for all the elements if there are any conflicts. This would work well if conflicts are rare enough that branch-mispredicts don't kill performance. (AVX2 doesn't have scatter instructions in the first place, though, so I'm not sure exactly what you need this for.)
The only-left or only-right behaviour is hard, but my method can give you a mask of which elements have conflicts with any other element (e.g. v[0] == v[3] would result in both conflict[0] and conflict[3] being true). This costs only 1 extra shuffle, or maybe 0 with a redesign with this goal in mind.
(I misread the question at first; I thought you wanted to check both directions, rather than talking about two different implementation options for most of what vpconflictd does. Actually at first I thought you just wanted a presence/absence check, like bool any_conflicts(__m256i).)
Finding presence/absence of any conflicts: bool any_conflicts32(__m256i)
8 choose 2 is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons. We should aim to do it with 4 vector compares, which leaves room for some redundancy.
Creating inputs for those compares will require shuffles, and some of those will have to be lane-crossing. 4 unique comparisons require at least 4 vectors (including the initial unshuffled copy), since 3 choose 2 is only 3.
Ideally as few as possible of the shuffles are lane-crossing, and there is lots of ILP for the compares and ORing of compare results. Also nice if the shuffles don't need a vector shuffle-control, just an imm8. Also good if they're not slow on AMD Ryzen, where 256b instructions are decoded into multiple 128b uops. (Some shuffles are worse than others for this, e.g. vperm2i128 is very bad; much worse than vpermq for swapping the high and low halves of a single vector. Unfortunately clang gets this wrong even with -mtune=znver1, and compiles _mm256_permute4x64_epi64 into vperm2i128 whenever it can).
I found a solution pretty early that achieves most of these goals: 3 shuffles, 4 compares. One of the shuffles is in-lane. All of them use an immediate control byte instead of a vector.
// returns a 0 or non-zero truth value
int any_conflicts32(__m256i v)
{
__m256i hilo = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(1,0,3,2)); // vpermq is much more efficient than vperm2i128 on Ryzen and KNL, same on HSW/SKL.
__m256i inlane_rotr1 = _mm256_shuffle_epi32(v, _MM_SHUFFLE(0,3,2,1));
__m256i full_rotl2 = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(2,1,0,3));
__m256i v_ir1 = _mm256_cmpeq_epi32(v, inlane_rotr1);
__m256i v_hilo= _mm256_cmpeq_epi32(v, hilo); // only really needs to be a 128b operation on the low lane, with leaving the upper lane zero.
// But there's no ideal way to express that with intrinsics, since _mm256_castsi128_si256 technically leaves the high lane undefined
// It's extremely likely that casting down and back up would always compile to correct code, though (using the result in a zero-extended register).
__m256i hilo_ir1 = _mm256_cmpeq_epi32(hilo, inlane_rotr1);
__m256i v_fl2 = _mm256_cmpeq_epi32(v, full_rotl2);
__m256i t1 = _mm256_or_si256(v_ir1, v_hilo);
__m256i t2 = _mm256_or_si256(t1, v_fl2);
__m256i conflicts = _mm256_or_si256(t2, hilo_ir1); // A serial dep chain instead of a tree is probably good because of resource conflicts from limited shuffle throughput
// if you're going to branch on this, movemask/test/jcc is more efficient than ptest/jcc
unsigned conflict_bitmap = _mm256_movemask_epi8(conflicts); // With these shuffles, positions in the bitmap aren't actually meaningful
return (bool)conflict_bitmap;
return conflict_bitmap;
}
How I designed this:
I made a table of all the element-pairs that needed to be checked, and made columns for which shuffled operands could take care of that requirement.
I started with a few shuffles that could be done cheaply, and it turned out my early guesses worked well enough.
My design notes:
// 7 6 5 4 | 3 2 1 0
// h g f e | d c b a
// e h g f | a d c b // inlanerotr1 = vpshufd(v)
// f e d c | b a h g // fullrotl2 = vpermq(v)
// d c b a | h g f e // hilo = vperm2i128(v) or vpermq. v:hilo has lots of redundancy. The low half has all the information.
v:lrot1 v:frotr2 lrotr1:frotl2 (incomplete)
* ab [0]v:lrotr1 [3]lr1:fl2
* ac [2]v:frotl2
* ad [3]v:lrotr1 [2]lr1:fl2
* ae [0,4]v:hilo
* af [4]hilo:lrotr1
* ag [0]v:frotl2
* ah [3]hilo:lrotr1
* bc [1]v:lrotr1
* bd [3]v:frotl2 [5]hilo:frotl2
* be [0]hilo:lrotr1
* bf [1,5]v:hilo
* bg [0]lr1:fl2 [5]hilo:lrotr1
* bh [1]v:frotl2
* cd [2]v:lrotr1
* ce [4]v:frotl2 [4]lr1:fl2
* cf [1]hilo:lrotr1
* cg [2,6]v:hilo
* ch [1]lr1:fl2 [6]hilo:lrotr1
* de [7]hilo:lrotr1
* df [5]v:frotl2 [7]hilo:frotl2
* dg [5]lr1:fl2 [2]hilo:lrotr1
* dh [3,7]v:hilo
* ef [4]v:lrotr1 [7]lr1:fl2
* eg [6]v:frotl2
* eh [7]v:lrotr1 [6]lr1:fl2
* fg [5]v:lrotr1
* fh [7]v:frotl2
* gh [6]v:lrotr1
*/
It turns out that in-lane rotr1 == full rotl2 has a lot of redundancy, so it's not worth using. It also turns out that having all the allowed redundancy in v==hilo works fine.
If you care about which result is in which element (rather than just checking for presence/absence),
then v == swap_hilo(lrotr1) could work instead of lrotr1 == hilo.
But we also need swap_hilo(v), so this would mean an extra shuffle.
We could instead shuffle after hilo==lrotr1, for better ILP.
Or maybe there's a different set of shuffles that gives us everything.
Maybe if we consider VPERMD with a vector shuffle-control...
Compiler asm output vs. optimal asm
gcc6.3 -O3 -march=haswell produces:
Haswell has one shuffle unit (on port5).
# assume ymm0 ready on cycle 0
vpermq ymm2, ymm0, 78 # hilo ready on cycle 3 (execution started on cycle 0)
vpshufd ymm3, ymm0, 57 # lrotr1 ready on cycle 2 (started on cycle 1)
vpermq ymm1, ymm0, 147 # frotl2 ready on cycle 5 (started on 2)
vpcmpeqd ymm4, ymm2, ymm0 # starts on 3, ready on 4
vpcmpeqd ymm1, ymm1, ymm0 # starts on 5, ready on 6
vpcmpeqd ymm2, ymm2, ymm3 # starts on 3, ready on 4
vpcmpeqd ymm0, ymm0, ymm3 # starts on 2, ready on 3
vpor ymm1, ymm1, ymm4 # starts on 6, ready on 7
vpor ymm0, ymm0, ymm2 # starts on 4, ready on 5
vpor ymm0, ymm1, ymm0 # starts on 7, ready on 8
# a different ordering of VPOR merging could have saved a cycle here. /scold gcc
vpmovmskb eax, ymm0
vzeroupper
ret
So the best-case latency is 8 cycles to have a single vector ready, given resource conflicts from other instructions in this sequence but assuming no conflicts with past instructions still in the pipeline. (Should have been 7 cycles, but gcc re-ordered the dependency structure of my intrinsics putting more stuff dependent on the compare of the last shuffle result.)
This is faster than Skylake-AVX512's vpconflictd ymm, which has 17c latency, one per 10c throughput. (Of course, that gives you much more information, and #harold's emulation of it takes many more instructions).
Fortunately gcc didn't re-order the shuffles and introduce a potential write-back conflict. (e.g. putting the vpshufd last would mean that dispatching the shuffle uops to port5 in oldest-first order would have the vpshufd ready in the same cycle as the first vpermq (1c latency vs. 3c).) gcc did this for one version of the code (where I compared the wrong variable), so it seems that gcc -mtune=haswell doesn't take this into account. (Maybe it's not a big deal, I haven't measured to see what the real effect on latency is. I know the scheduler is smart about picking uops from the Reservation Station to avoid actual write-back conflicts, but IDK how smart it is, i.e. whether it would run the vpshufd ahead of a later vpermq to avoid a write-back conflict, since it would have to look-ahead to even see the upcoming writeback conflict. More likely it would just delay the vpshufd for an extra cycle before dispatching it.)
Anyway, this is why I put _mm_shuffle_epi32 in the middle in the C source, where it makes things easy for OOO execution.
Clang 4.0 goes berserk and packs each compare result down to 128b vectors (with vextracti128 / vpacksswb), then expands back to 256b after three vpor xmm before pmovmskb. I thought at first it was doing this because of -mtune=znver1, but it does it with -mtune=haswell as well. It does this even if we return a bool, which would let it just pmovmskb / test on the packed vector. /facepalm. It also pessimizes the hilo shuffle to vperm2i128, even with -mtune=znver1 (Ryzen), where vperm2i128 is 8 uops but vpermq is 3. (Agner Fog's insn tables for some reasons missed those, so I took those numbers from the FP equivalents vperm2f128 and vpermpd)
#harold says that using add instead of or stops clang from packing/unpacking, but vpaddd has lower throughput than vpor on Intel pre-Skylake.
Even better for Ryzen, the v == hilo compare can do only the low half. (i.e. use vpcmpeqd xmm2, xmm2, xmm3, which is only 1 uop instead of 2). We still need the full hilo for hilo == lrot1, though. So we can't just use vextracti128 xmm2, xmm0, 1 instead of the vpermq shuffle. vextracti128 has excellent performance on Ryzen: 1 uop, 1c latency, 0.33c throughput (can run on any of P0/1/3).
Since we're ORing everything together, it's fine to have zeros instead of redundant compare results in the high half.
As I noted in comments, IDK how to safely write this with intrinsics. The obvious way would be to use _mm256_castsi128_si256 (_mm_cmpeq_epi32(v, hilo)), but that technically leaves the high lane undefined, rather than zero. There's no sane way a compiler would do anything other than use the full-width ymm register that contains the xmm register with the 128b compare result, but it would be legal according to Intel's docs for a Deathstation-9000 compiler to put garbage there. Any explicit way of getting zeros in the high half would depend on the compiler optimizing it away. Maybe _mm256_setr_si128(cmpresult, _mm_setzero_si128());.
There are no current CPUs with AVX512F but not AVX512CD. But if that combo is interesting or relevant, clang makes some interesting asm from my code with -mavx512f -mavx512vl. It uses EVEX vpcmpeqd into mask registers, and korw to merge them. But then it expands that back into a vector to set up for vpmovmaskb, instead of just optimizing away the movemask and using the korw result. /facepalm.

Sorting 64-bit structs using AVX?

I have a 64-bit struct which represents several pieces of data, one of which is a floating point value:
struct MyStruct{
uint16_t a;
uint16_t b;
float f;
};
and I have four of these structs in, lets say an std::array<MyStruct, 4>
is it possible to use AVX to sort the array, in terms of the float member MyStruct::f?
Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication.
I have 4 separate ideas:
Normal sorting, but moving the struct as a 64bit unit
Vectorized insertion-sort as a building block for qsort
Sorting networks, with a comparator implementation using cmpps / blendvpd instead of minps/maxps. The extra overhead might kill the speedup, though.
Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload. Use Timothy Furtak's technique of doing a normal minps/maxps comparator and then cmpeqps min,orig -> masked xor-swap on the payload. This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators. Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).
This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.
Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot.
Normal sorting
There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element. For this idea to work as well as possible, order your struct with the float value first, then you could movq a whole struct into an xmm reg, and the float value would be in the low32 for ucomiss. Then you (or maybe a smart compiler) could store the struct with a movq.
Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:
icc seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. Maybe it doesn't pack the struct? gcc 5.1 doesn't seem to have that problem most of the time.
Speeding up insertion-sort
Big sorts usually divide-and-conquer with insertion sort for small-enough problems. Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs. So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any. Do you smell vectors? I smell vectors.
# RSI points to struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using! ######
.new_elem:
vbroadcastsd ymm0, [rdi] # broadcast the whole struct
mov rdx, rdi
.search_loop:
sub rdx, 32
vmovups ymm1, [rdx] # load some sorted data
vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
vmovups [rdx+8], ymm1 # shuffle it over to make space, usual insertion-sort style
cmp rdx, rsi
jbe .endsearch # below-or-equal (addresses are unsigned)
movmskps eax, ymm2
test al, 0b01010101 # test only the compare results for
jz .search_loop # [rdi] wasn't less than any of the 4 elements
.endsearch:
# TODO: scalar loop to find out where the new element goes.
# All we know is that it's less than one of the elements in ymm1, but not which
add rdi, 8
vmovsd [rdx], ymm0
cmp rdi, r8 # pointer to the end of the buf
jle .new_elem
# worse alternative to movmskps / test:
# vtestps ymm2, ymm7 # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
# vtestps is like PTEST, but only tests the high bit. If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.
This is certainly full of bugs, and I should have just written it in C with intrinsics.
This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.
Probably pipelining the loop so we haven't stored ymm1 until the next iteration (or after breaking out) would save a redundant store. Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win. This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups, and the low one in another reg for vmovsd.
I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2). However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.
Anyway, if anyone does develop this idea into actual debugged and working code, let us know.
update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort). He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops. (insertion-sort on a mostly-sorted array is fast.)
Which leads us to:
Sorting Networks
There might not be any speedup here. Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. With wider elements, fewer fit in a vector reg. (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.
Sorting networks can be constructed using SSE/AVX packed-compare instructions. (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). They use pshufd to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.
Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element. (i.e. Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload) pairs, where the payload can be an index or 32bit pointer (mmap(MAP_32bit), or x32 ABI).
So let's build ourselves a comparator. In sorting-network parlance, that's an operation that sorts a pair of inputs. So it either swaps an elements between registers, or not.
# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; } inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
# UNTESTED
You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float. I'm not sure if that happens only for mul/div, or if it would affect compare.
Intel Haswell: Latency: 5 cycles for ymm2 to be ready, 7 cycles for ymm3. Throughput: one per 4 cycles. (p5 bottleneck).
Intel Sandybridge/Ivybridge: Latency: 5 cycles for ymm2 to be ready, 6 cycles for ymm3. Throughput: one per 2 cycles. (p0/p5 bottleneck).
AMD Bulldozer/Piledriver: (vblendvpd ymm: 2c lat, 2c recip tput): lat: 4c for ymm2, 6c for ymm3. Or worse, with bypass delays between cmpps and blend. tput: one per 4c. (bottleneck on vector P1)
AMD Steamroller: (vblendvpd ymm: 2c lat, 1c recip tput): lat: 4c for ymm2, 5c for ymm3. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend).
VBLENDVPD is 2 uops. (It has 3 reg inputs, so it can't be 1 uop :/). Both uops can only run on shuffle ports. On Haswell, that's only port5. On SnB, that's p0/p5. (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.)
If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.
The usual minps/maxps pair is 3 and 4 cycles latency (ymm2/3), and one per 2 cycles throughput (Intel). (p1 bottleneck on the FP add/sub/compare unit). The most fair comparison is probably to sorting 64bit doubles. The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared. The halved throughput on Haswell will cut into any speedups pretty heavily.
Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison. min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies.
With hyperthreading, another thread that can keep the other ports busy (e.g. port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.
I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations. It ended up slower, though, because both sources have to get masked both ways before combining.
# AVX2 comparator for Haswell
# struct { float f; uint16_t a, b; } inputs in ymm0, ymm1
#
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2) # extend the mask to the payload part. There's no mask function, I just don't want to work out the result in my head.
vpand ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where ymm0[i] < ymm1[i]
vpandn ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where !(ymm0[i] < ymm1[i])
vpor ymm2, ymm10, ymm11 # ymm2 = min_packed_mystruct(ymm0, ymm1)
vpandn ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where !(ymm0[i] < ymm1[i])
vpand ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where ymm0[i] < ymm1[i]
vpor ymm3, ymm10, ymm11 # ymm2 = max_packed_mystruct(ymm0, ymm1)
# result: !(ymm2[i] > ymm3[i])
# UNTESTED
This is 8 uops, compared to 5 for the blendv version. There's a lot of parallelism in the last 6 and/andn/or instructions. cmpps has 3 cycle latency, though. I think ymm2 will be ready in 6 cycles, while ymm3 is ready in 7. (And can overlap with operations on ymm2). The insns following a comparator op will probably be shuffles, to put the data in the right elements for the next compare. There's no forwarding delay to/from the shuffle unit for integer-domain logicals, even for a vshufps, but the result should come out in the FP domain, ready for a vcmpps. Using vpand instead of vandps is essential for throughput.
Timothy Furtak's paper suggests an approach for sorting keys with a payload: don't pack the payload pointers with the keys, but instead generate a mask from the compare, and use it on both the keys and the payload the same way. This means you have to separate the payload from the keys either in your data structure, or every time you load a struct.
See the appendix of his paper (Fig. 12). He uses the standard min/max on the keys, and then uses cmpps to see which elements CHANGED. Then he ANDs that mask in the middle of an xor-swap to end up only swapping the payloads for the keys that swapped.
Unfortunately, original AVX has very limited shuffling across its 128-bit halves (i.e. lanes), so it is hard to sort contents of a full 256-bit register. However, AVX2 has shuffling operations without such limitations, so we can perform a sort of 4 structs in vectorized way.
I'll use the idea of this solution. In order to sort an array we have to do enough element comparisons to surely determine the permutation we need to apply. Given that no element is NaN, it is enough to check for each pair of different elements a and b whether a < b and whether a > b. Having this information, we can fully compare any two elements, which must be enough to determine final sorting order. This is 6 pairs of 32-bit elements and two comparison modes, so we can end up doing two shuffles and two comparisons in AVX. If you are absolutely sure that all the elements are distinct, then you can avoid a > b comparisons and reduce size of LUT.
For repacking of elements within register we can use _mm256_permutevar8x32_ps. One instruction allows to do arbitrary shuffle on 32-bit granularity. Note that in the code I assume that sorting key f is the first member of your struct (just as #PeterCordes proposed), but you can trivially use this solution for you current struct if you change shuffling mask accordingly.
After we perform the comparisons, we have a two AVX registers containing boolean results as 32-bit masks. The first six masks in each register are important, the last two are not. Then we want to convert these masks to a small integer in general-purpose register to be used as index in a lookup table. In general case we may have to create perfect hashing for it, but it is not necessary here. We can use _mm256_movemask_ps to get a 8-bit integer mask in general purpose register from AVX register. Since the last two masks per register are not important, we can ensure that they are always zero. Then the resulting index would be in range [0..2^12).
Finally, we load a shuffling mask from precomputed LUT with 4096 elements and pass it to _mm256_permutevar8x32_ps. As a result we obtain an AVX register with 4 properly sorted structs of your type. Precomputing the LUT is your home assignment =)
Here is the final code:
__m256i lut[4096]; //LUT of 128Kb size must be precomputed
__m256 Sort4(__m256 val) {
__m256 aaabbcaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(0, 0, 0, 2, 2, 4, 0, 0));
__m256 bcdcddaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(2, 4, 6, 4, 6, 6, 0, 0));
__m256 cmpLt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_LT_OQ);
__m256 cmpGt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_GT_OQ);
int idxLt = _mm256_movemask_ps(cmpLt);
int idxGt = _mm256_movemask_ps(cmpGt);
__m256i shuf = lut[idxGt * 64 + idxLt];
__m256 res = _mm256_permutevar8x32_ps(val, shuf);
return res;
}
Here you can see generated assembly. There are 14 instructions in total, 2 of them are for loading constant shuffling masks, and one of them is due to useless 32-bit->64-bit conversion of movemask results. So in a tight loop it would be 11-12 instructions. IACA says that four calls in a loop have 16.40 cycles throughput on Haswell, so it seems to achieve throughput 4.1 cycles per call.
Of course 128 Kb lookup table is too much unless you are going to process even more input data in one batch. It may be possible to reduce LUT size with adding perfect hashing (sacrificing speed of course). It is hard to say how much orderings are possible on four elements, but clearly less than 4! * 2^3 = 192. I think 256-element LUT is possible, maybe even 128-element LUT. With perfect hashing it may be faster to combine two AVX registers into one with shift and xor, then do _mm256_movemask_epi8 once (instead of doing two _mm256_movemask_ps and combining them afterwards).