Sorting 64-bit structs using AVX? - c++

I have a 64-bit struct which represents several pieces of data, one of which is a floating point value:
struct MyStruct{
uint16_t a;
uint16_t b;
float f;
};
and I have four of these structs in, lets say an std::array<MyStruct, 4>
is it possible to use AVX to sort the array, in terms of the float member MyStruct::f?

Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication.
I have 4 separate ideas:
Normal sorting, but moving the struct as a 64bit unit
Vectorized insertion-sort as a building block for qsort
Sorting networks, with a comparator implementation using cmpps / blendvpd instead of minps/maxps. The extra overhead might kill the speedup, though.
Sorting networks: load some structs, then shuffle/blend to get some registers of just floats and some registers of just payload. Use Timothy Furtak's technique of doing a normal minps/maxps comparator and then cmpeqps min,orig -> masked xor-swap on the payload. This sorts twice as much data per comparator, but does require matching shuffles on two registers between comparators. Also requires re-interleaving when you're done (but that's easy with unpcklps / unpckhps, if you arrange your comparators so those in-lane unpacks will put the final data in the right order).
This also avoids potential slowdowns that some CPUs may have when doing FP comparisons on bit patterns in the payload that represent denormals, NaNs, or infinities, without resorting to setting the denormals-are-zero bit in MXCSR.
Furtak's paper suggests doing a scalar cleanup after getting things mostly sorted with vectors, which would reduce the amount of shuffling a lot.
Normal sorting
There's at least a small speedup to be gained when using normal sorting algorithms, by moving the whole struct around with 64bit loads/stores, and doing a scalar FP compare on the FP element. For this idea to work as well as possible, order your struct with the float value first, then you could movq a whole struct into an xmm reg, and the float value would be in the low32 for ucomiss. Then you (or maybe a smart compiler) could store the struct with a movq.
Looking at the asm output that Kerrek SB linked to, compilers seem to do a rather bad job of efficiently copying structs around:
icc seems to movzx the two uint values separately, rather than scooping up the whole struct in a 64b load. Maybe it doesn't pack the struct? gcc 5.1 doesn't seem to have that problem most of the time.
Speeding up insertion-sort
Big sorts usually divide-and-conquer with insertion sort for small-enough problems. Insertion sort copies array elements over by one, stopping only when we find we've reached the spot where the current element belongs. So we need to compare one element to a sequence of packed elements, stopping if the comparison is true for any. Do you smell vectors? I smell vectors.
# RSI points to struct { float f; uint... payload; } buf[];
# RDI points to the next element to be inserted into the sorted portion
# [ rsi to rdi ) is sorted, the rest isn't.
##### PROOF OF CONCEPT: debug / finish writing before using! ######
.new_elem:
vbroadcastsd ymm0, [rdi] # broadcast the whole struct
mov rdx, rdi
.search_loop:
sub rdx, 32
vmovups ymm1, [rdx] # load some sorted data
vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in any element where ymm0[i] < ymm1[i] (FP compare, false if either is NaN).
vmovups [rdx+8], ymm1 # shuffle it over to make space, usual insertion-sort style
cmp rdx, rsi
jbe .endsearch # below-or-equal (addresses are unsigned)
movmskps eax, ymm2
test al, 0b01010101 # test only the compare results for
jz .search_loop # [rdi] wasn't less than any of the 4 elements
.endsearch:
# TODO: scalar loop to find out where the new element goes.
# All we know is that it's less than one of the elements in ymm1, but not which
add rdi, 8
vmovsd [rdx], ymm0
cmp rdi, r8 # pointer to the end of the buf
jle .new_elem
# worse alternative to movmskps / test:
# vtestps ymm2, ymm7 # where ymm7 is loaded with 1s in the odd (float) elements, and 0s in the even (payload) elements.
# vtestps is like PTEST, but only tests the high bit. If the struct was in the other order, with the float high, vtestpd against a register of all-1s would work, as that's more convenient to generate.
This is certainly full of bugs, and I should have just written it in C with intrinsics.
This is an insertion sort with probably more overhead than most, that might lose to a scalar version for very small problem sizes, due to the extra complexity of handling the first few element (don't fill a vector), and of figuring out where to put the new element after breaking out of the vector search loop that checked multiple elements.
Probably pipelining the loop so we haven't stored ymm1 until the next iteration (or after breaking out) would save a redundant store. Doing the compares in registers by shifting / shuffling them, instead of literally doing scalar load/compares would probably be a win. This could end up with way too many unpredictable branches, and I'm not seeing a nice way to end up with the high 4 packed in a reg for vmovups, and the low one in another reg for vmovsd.
I may have invented an insertion sort that's the worst of both worlds: slow for small arrays because of more work after breaking out of the search loop, but it's still insertion sort: slow for large arrays because of O(n^2). However, if the code outside the searchloop can be made non-horrible, this could be a useful as the small-array endpoint for qsort / mergesort.
Anyway, if anyone does develop this idea into actual debugged and working code, let us know.
update: Timothy Furtak's paper describes an SSE implementation for sorting short arrays (for use as a building block for bigger sorts, like this insertion sort). He suggests producing a partially-ordered result with SSE, and then doing a cleanup with scalar ops. (insertion-sort on a mostly-sorted array is fast.)
Which leads us to:
Sorting Networks
There might not be any speedup here. Xiaochen, Rocki, and Suda only report a 3.7x speedup from scalar -> AVX-512 for 32bit (int) elements, for single-threaded mergesort, on a Xeon Phi card. With wider elements, fewer fit in a vector reg. (That's a factor of 4 for us: 64b elements in 256b, vs. 32b elements in 512b.) They also take advantage of AVX512 masks to only compare some lanes, a feature not available in AVX. Plus, with a slower comparator function that competes for the shuffle/blend unit, we're already in worse shape.
Sorting networks can be constructed using SSE/AVX packed-compare instructions. (More usually, with a pair of min/max instructions that effectively do a set of packed 2-element sorts.) Larger sorts can be built up out of an operation that does pairwise sorts. This paper by Tian Xiaochen, Kamil Rocki and Reiji Suda at U of Tokyo has some real AVX code for sorting (without payloads), and discussion of how it's tricky with vector registers because you can't compare two elements that are in the same register (so the sorting network has to be designed to not require that). They use pshufd to line up elements for the next comparison, to build up a larger sort out of sorting just a few registers full of data.
Now, the trick is to do a sort of pairs of packed 64b elements, based on the comparison of only half an element. (i.e. Keeping the payload with the sort key.) We could potentially sort other things this way, by sorting an array of (key, payload) pairs, where the payload can be an index or 32bit pointer (mmap(MAP_32bit), or x32 ABI).
So let's build ourselves a comparator. In sorting-network parlance, that's an operation that sorts a pair of inputs. So it either swaps an elements between registers, or not.
# AVX comparator for SnB/IvB
# struct { uint16_t a, b; float f; } inputs in ymm0, ymm1
# NOTE: struct order with f second saves a shuffle to extend the mask
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
# vblendvpd checks the high bit of the 64b element, so mask *doesn't* need to be extended to the low32
vblendvpd ymm2, ymm1, ymm0, ymm7
vblendvpd ymm3, ymm0, ymm1, ymm7
# result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (NaN).)
# UNTESTED
You might need to set the MXCSR to make sure that int bits don't slow down your FP ops if they happen to represent a denormal or NaN float. I'm not sure if that happens only for mul/div, or if it would affect compare.
Intel Haswell: Latency: 5 cycles for ymm2 to be ready, 7 cycles for ymm3. Throughput: one per 4 cycles. (p5 bottleneck).
Intel Sandybridge/Ivybridge: Latency: 5 cycles for ymm2 to be ready, 6 cycles for ymm3. Throughput: one per 2 cycles. (p0/p5 bottleneck).
AMD Bulldozer/Piledriver: (vblendvpd ymm: 2c lat, 2c recip tput): lat: 4c for ymm2, 6c for ymm3. Or worse, with bypass delays between cmpps and blend. tput: one per 4c. (bottleneck on vector P1)
AMD Steamroller: (vblendvpd ymm: 2c lat, 1c recip tput): lat: 4c for ymm2, 5c for ymm3. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports P0/1, for cmp and blend).
VBLENDVPD is 2 uops. (It has 3 reg inputs, so it can't be 1 uop :/). Both uops can only run on shuffle ports. On Haswell, that's only port5. On SnB, that's p0/p5. (IDK why Haswell halved the shuffle / blend throughput compared to SnB/IvB.)
If AMD designs had 256b-wide vector units, their lower-latency FP compare and single-macro-op decoding of 3-input instructions would put them ahead.
The usual minps/maxps pair is 3 and 4 cycles latency (ymm2/3), and one per 2 cycles throughput (Intel). (p1 bottleneck on the FP add/sub/compare unit). The most fair comparison is probably to sorting 64bit doubles. The extra latency, may hurt if there aren't multiple pairs of independent registers to be compared. The halved throughput on Haswell will cut into any speedups pretty heavily.
Also keep in mind that shuffles are needed between comparator operations to get the right elements lined up for comparison. min/maxps leave the shuffle ports unused, but my cmpps/blendv version saturates them, meaning the shuffling can't overlap with comparing, except as something to fill gaps left by data dependencies.
With hyperthreading, another thread that can keep the other ports busy (e.g. port 0/1 fp mul/add units, or integer code) would share a core quite nicely with this blend-bottlenecked version.
I attempted another version for Haswell, which does the blends "manually" using bitwise AND/OR operations. It ended up slower, though, because both sources have to get masked both ways before combining.
# AVX2 comparator for Haswell
# struct { float f; uint16_t a, b; } inputs in ymm0, ymm1
#
vcmpps ymm7, ymm0, ymm1, _CMP_LT_OQ # imm8=17: less-than, ordered, quiet (non-signalling on NaN)
# ymm7 32bit elements = 0xFFFFFFFF if ymm0[i] < ymm1[i], else 0
vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2) # extend the mask to the payload part. There's no mask function, I just don't want to work out the result in my head.
vpand ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where ymm0[i] < ymm1[i]
vpandn ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where !(ymm0[i] < ymm1[i])
vpor ymm2, ymm10, ymm11 # ymm2 = min_packed_mystruct(ymm0, ymm1)
vpandn ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements where !(ymm0[i] < ymm1[i])
vpand ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements where ymm0[i] < ymm1[i]
vpor ymm3, ymm10, ymm11 # ymm2 = max_packed_mystruct(ymm0, ymm1)
# result: !(ymm2[i] > ymm3[i])
# UNTESTED
This is 8 uops, compared to 5 for the blendv version. There's a lot of parallelism in the last 6 and/andn/or instructions. cmpps has 3 cycle latency, though. I think ymm2 will be ready in 6 cycles, while ymm3 is ready in 7. (And can overlap with operations on ymm2). The insns following a comparator op will probably be shuffles, to put the data in the right elements for the next compare. There's no forwarding delay to/from the shuffle unit for integer-domain logicals, even for a vshufps, but the result should come out in the FP domain, ready for a vcmpps. Using vpand instead of vandps is essential for throughput.
Timothy Furtak's paper suggests an approach for sorting keys with a payload: don't pack the payload pointers with the keys, but instead generate a mask from the compare, and use it on both the keys and the payload the same way. This means you have to separate the payload from the keys either in your data structure, or every time you load a struct.
See the appendix of his paper (Fig. 12). He uses the standard min/max on the keys, and then uses cmpps to see which elements CHANGED. Then he ANDs that mask in the middle of an xor-swap to end up only swapping the payloads for the keys that swapped.

Unfortunately, original AVX has very limited shuffling across its 128-bit halves (i.e. lanes), so it is hard to sort contents of a full 256-bit register. However, AVX2 has shuffling operations without such limitations, so we can perform a sort of 4 structs in vectorized way.
I'll use the idea of this solution. In order to sort an array we have to do enough element comparisons to surely determine the permutation we need to apply. Given that no element is NaN, it is enough to check for each pair of different elements a and b whether a < b and whether a > b. Having this information, we can fully compare any two elements, which must be enough to determine final sorting order. This is 6 pairs of 32-bit elements and two comparison modes, so we can end up doing two shuffles and two comparisons in AVX. If you are absolutely sure that all the elements are distinct, then you can avoid a > b comparisons and reduce size of LUT.
For repacking of elements within register we can use _mm256_permutevar8x32_ps. One instruction allows to do arbitrary shuffle on 32-bit granularity. Note that in the code I assume that sorting key f is the first member of your struct (just as #PeterCordes proposed), but you can trivially use this solution for you current struct if you change shuffling mask accordingly.
After we perform the comparisons, we have a two AVX registers containing boolean results as 32-bit masks. The first six masks in each register are important, the last two are not. Then we want to convert these masks to a small integer in general-purpose register to be used as index in a lookup table. In general case we may have to create perfect hashing for it, but it is not necessary here. We can use _mm256_movemask_ps to get a 8-bit integer mask in general purpose register from AVX register. Since the last two masks per register are not important, we can ensure that they are always zero. Then the resulting index would be in range [0..2^12).
Finally, we load a shuffling mask from precomputed LUT with 4096 elements and pass it to _mm256_permutevar8x32_ps. As a result we obtain an AVX register with 4 properly sorted structs of your type. Precomputing the LUT is your home assignment =)
Here is the final code:
__m256i lut[4096]; //LUT of 128Kb size must be precomputed
__m256 Sort4(__m256 val) {
__m256 aaabbcaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(0, 0, 0, 2, 2, 4, 0, 0));
__m256 bcdcddaa = _mm256_permutevar8x32_ps(val, _mm256_setr_epi32(2, 4, 6, 4, 6, 6, 0, 0));
__m256 cmpLt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_LT_OQ);
__m256 cmpGt = _mm256_cmp_ps(aaabbcaa, bcdcddaa, _CMP_GT_OQ);
int idxLt = _mm256_movemask_ps(cmpLt);
int idxGt = _mm256_movemask_ps(cmpGt);
__m256i shuf = lut[idxGt * 64 + idxLt];
__m256 res = _mm256_permutevar8x32_ps(val, shuf);
return res;
}
Here you can see generated assembly. There are 14 instructions in total, 2 of them are for loading constant shuffling masks, and one of them is due to useless 32-bit->64-bit conversion of movemask results. So in a tight loop it would be 11-12 instructions. IACA says that four calls in a loop have 16.40 cycles throughput on Haswell, so it seems to achieve throughput 4.1 cycles per call.
Of course 128 Kb lookup table is too much unless you are going to process even more input data in one batch. It may be possible to reduce LUT size with adding perfect hashing (sacrificing speed of course). It is hard to say how much orderings are possible on four elements, but clearly less than 4! * 2^3 = 192. I think 256-element LUT is possible, maybe even 128-element LUT. With perfect hashing it may be faster to combine two AVX registers into one with shift and xor, then do _mm256_movemask_epi8 once (instead of doing two _mm256_movemask_ps and combining them afterwards).

Related

Intel store instructions on delibrately overlapping memory regions

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had
reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);
and compiling by clang gives
vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128 xmmword ptr [rsi], ymm0, 1
If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?
Other ways to store YMM into size 3 arrays are welcomed as well.
There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).
It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.
The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.
In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.
However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:
Operation
MEM[loaddr+127:loaddr] := a[127:0]
MEM[hiaddr+127:hiaddr] := a[255:128]
So it's strictly defined as low address store first (the second arg; I think you got this backwards).
And all of that is also moot because there's a more efficient way
Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).
On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)
On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.
You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.
The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.
vextractf128 xmm1, ymm0, 1 ; single uop everywhere
vmovupd [rdi], xmm0 ; single uop everywhere
vmovsd [rdi+2*8], xmm1 ; single uop everywhere
So you want something like this, which should compile efficiently.
_mm_store_pd(vec, _mm256_castpd256_pd128(reg)); // low half
__m128d hi = _mm256_extractf128_pd(reg, 1);
_mm_store_sd(vec+2, hi);
// or vec[2] = _mm_cvtsd_f64(hi);
vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.
There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.
Masked store:
As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.
Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.
On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.
On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).
So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.
vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

Split a number into several numbers, each with only one significant bit

Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.
I want to isolate each set bit in a number. For example,
input:
01100100
output:
01000000
00100000
00000100
Only comes to mind number & mask.
Assembly or ะก++.
Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:
while (number) {
// extract lowest set bit in number
uint64_t m = number & -number;
/// use m
...
// remove lowest set bit from number
number &= number - 1;
}
In modern x64 assembly, number & -number may be compiled to blsi, and number &= number - 1 may be compiled to blsr which are both fast, so this would only take a couple of efficient instructions to implement.
Since m is available, resetting the lowest set bit may be done with number ^= m but that may make it harder for the compiler to see that it can use blsr, which is a better choice because it depends only directly on number so it shortens the loop carried dependency chain.
The standard way is
while (num) {
unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
...
num ^= mask;
}
for example starting with num = 2019 you will get in order
1
2
32
64
128
256
512
1024
If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see #harold's answer.
But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).
The key building block is AVX512F vpcompressd : given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.
An AVX512 ZMM / __m512i vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd or anything like that; we can just kmov it into a mask register and use it directly.)
See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?
#include <stdint.h>
#include <immintrin.h>
// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
const __m512i bitmasks_lo = _mm512_set_epi32(
1UL << 15, 1UL << 14, 1UL << 13, 1UL << 12,
1UL << 11, 1UL << 10, 1UL << 9, 1UL << 8,
1UL << 7, 1UL << 6, 1UL << 5, 1UL << 4,
1UL << 3, 1UL << 2, 1UL << 1, 1UL << 0
);
const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16); // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.
__mmask16 set_lo = x;
__mmask16 set_hi = x>>16;
int count_lo = _mm_popcnt_u32(set_lo); // doesn't actually cost a kmov, __mask16 is really just uint16_t
_mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
_mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);
return _mm_popcnt_u32(x);
}
Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl; it doesn't depend on AVX512BW or DQ.)
# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
movzx ecx, si
popcnt eax, esi
shr esi, 16
popcnt edx, ecx
kmovd k1, ecx
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
vpcompressd zmmword ptr [rdi] {k1}, zmm0
kmovd k1, esi
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
vpcompressd zmmword ptr [rdi + 4*rdx] {k1}, zmm0
vzeroupper
ret
On Skylake-AVX512, vpcompressd zmm{k1}, zmm is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.
It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1} so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.
On KNL, vpcompressd zmm{k1} is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).
This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x values, so we could hoist the vmovdqa64 loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.
Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32 is still 1/clock, so presumably still port 5 as well.)
(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)
Latency: integer result is just one popcnt (3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.
Compacting the 1<<0..15 constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15) with vpmovzxbd, then using that with vpsllvd on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).
I would generate the hi part with a shift from lo, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.
You could compress the lo constant with a vpmovzxwd load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.
If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd into registers and maybe use count_lo to look up a shuffle control vector for vpermt2d. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.
Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.
You can do it in a loop, though, using vpslld by 8 to move bits in vector elements. You'd think kshiftrq would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr / kmov and movzx / popcnt. (Using a counter += 8 and bzhi to feed popcnt would cost more uops).
The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.
vector: vpslld immediate, starting from the vector constant
mask: shr r64, 8 starting with x. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)
pointer: lea rdi, [rdi + rax*4] where RAX holds a popcnt result.
The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd shuffles and kmov

Arbitrary position 2-input shuffling using SSE

I have two 4 component vectors which I load into two __m128 variables.
Then I need to shuffle those so that the result looks like this:
Given:
__m128 mmMin = _mm_load_ps(&glm::vec4(-1.0f,-2.0f,-3.0f,-4.0f)[0]);
__m128 mmMax = _mm_load_ps(&glm::vec4(1.0f,2.0f,3.0f,4.0f)[0]);
I want the result of the shuffle to look like this:
// {mmMin.x,mmMax.x,mmMin.x,mmMax.x}
But I see it is not possible to do with _mm_shuffle_ps.
From SSE docs I see _mm_shuffle_ps masks always
inserts into result 2 values from the lower 2 components of __m128 first,then 2 from the high 2 components.
SPU intrinsics have si_shufb method which allows defining qword based mask and shuffle whatever position I wish. Is there a similar method in SSE?
I am using SSE2, but will be happy also to see how it can be done with other versions, including AVX.
With only SSE2 I think you need 2 shuffles: unpcklps to interleave and then unpcklpd same,same or shufps same,same to broadcast the low 64 bits.
With AVX512F, vpermt2ps can do this in one shuffle (using a control vector); I don't think there are any 2-source shuffles in AVX2 or earlier with fine enough granularity and flexible source locations before that. And no fixed shuffles that duplicate an element along with interleaving.
2-source shuffles are rare until AVX512: mostly fixed shuffles like unpckl/h* and palignr. It's mostly just [v]shufps / [v]shufpd until then. Variable-control shuffles are also rare: until AVX, the only one is pshufb. AVX1/2 added some variable-control dword-element shuffles, but only for 1 source. There are no variable-control 2-source shuffles until AVX512.
Immediate shuffles would need more than 4 groups of 2-bit indices to handle arbitrary indexing into the concatenation of two 4-element vectors. But x86 SIMD instructions always have at most an 8-bit immediate operand. Unfortunately no broadcast-immediate like ARM has that could efficiently create a vector of 1.0f or whatever.
AVX
Since you only need 1 element from each vector, instead of loading a whole vector you can use an AVX broadcast-load and then vblendps
Broadcast-loads are the same cost as normal loads on Intel CPUs (don't cost you a uop for the shuffle port, purely handled in the load port). They can't fold into memory operands for ALU instructions until AVX512F, but they do avoid shuffle-port bottlenecks. AMD CPUs may still need an ALU uop but they have more shuffle ALUs so shuffle throughput isn't a bottleneck nearly as much. (https://agner.org/optimize/)
Ryzen vbroadcastss xmm, [mem] is 2 separate uops for the front-end unfortunately, but it still has 2-per-clock throughput.
blend-immediate on dword and later elements is very efficient and can run on any port on Haswell and later, or 2 ports on SnB/IvB and Ryzen. But still single uop / 1c latency even on Nehalem.
#include <immintrin.h>
__m128 broadcast_interleave_scalars_avx(const float *min, const float *max) {
__m128 minx = _mm_broadcast_ss(min);
__m128 maxx = _mm_broadcast_ss(max);
return _mm_blend_ps(minx, maxx, 0b1010);
}
On Godbolt, clang's asm comments confirm that I got the blend constant right:
vbroadcastss xmm0, dword ptr [rdi]
vbroadcastss xmm1, dword ptr [rsi]
vblendps xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
If your data was already in registers, not freshly loaded, you might want to just use 2 shuffles.
With SSE4.1 you might be able to do 2x movddup loads to broadcast 64 bits from memory (including the 32 bits you care about) then blendps. The first load will load 32 bits past the float you care about, the 2nd will load 32 bits before the float you care about.
To get a C++ compiler to emit this for you you'll have to pointer-cast to double* for the __m128d _mm_loaddup_pd (double const* mem_addr) loads, and then use _mm_castpd_ps to get __m128 from __m128d.
https://www.felixcloutier.com/x86/movsldup could also be useful to set up for unpcklps.

Fallback implementation for conflict detection in AVX2

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2?
I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict.
Basically I need an AVX2 equivalent for
__mm256i detect_conflict(__mm256i a) {
__mm256i cd = _mm256_conflict_epi32(a);
return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0));
}
The only way I could think of is to use _mm256_permutevar8x32_epi32() shift each value right by 1 (across the lanes) and than do seven compares, mask out the unsed bits and than _mm256_or_si256() them together which is horribly slow.
TL:DR: Since full detection of which elements conflict is expensive, it's probably worth doing more fall-back work in exchange for cheaper detection. This depends on your conflict-handling options / strategies.
I came up with a fairly efficient way check for presence/absence of conflicts without finding their locations, like this answer for 64-bit integer elements. It's actually faster than Skylake-AVX512's micro-coded vpconflictd ymm, but of course it gives you much less information. (KNL has fast vpconflictd).
You could use a fully-scalar fallback for all the elements if there are any conflicts. This would work well if conflicts are rare enough that branch-mispredicts don't kill performance. (AVX2 doesn't have scatter instructions in the first place, though, so I'm not sure exactly what you need this for.)
The only-left or only-right behaviour is hard, but my method can give you a mask of which elements have conflicts with any other element (e.g. v[0] == v[3] would result in both conflict[0] and conflict[3] being true). This costs only 1 extra shuffle, or maybe 0 with a redesign with this goal in mind.
(I misread the question at first; I thought you wanted to check both directions, rather than talking about two different implementation options for most of what vpconflictd does. Actually at first I thought you just wanted a presence/absence check, like bool any_conflicts(__m256i).)
Finding presence/absence of any conflicts: bool any_conflicts32(__m256i)
8 choose 2 is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons. We should aim to do it with 4 vector compares, which leaves room for some redundancy.
Creating inputs for those compares will require shuffles, and some of those will have to be lane-crossing. 4 unique comparisons require at least 4 vectors (including the initial unshuffled copy), since 3 choose 2 is only 3.
Ideally as few as possible of the shuffles are lane-crossing, and there is lots of ILP for the compares and ORing of compare results. Also nice if the shuffles don't need a vector shuffle-control, just an imm8. Also good if they're not slow on AMD Ryzen, where 256b instructions are decoded into multiple 128b uops. (Some shuffles are worse than others for this, e.g. vperm2i128 is very bad; much worse than vpermq for swapping the high and low halves of a single vector. Unfortunately clang gets this wrong even with -mtune=znver1, and compiles _mm256_permute4x64_epi64 into vperm2i128 whenever it can).
I found a solution pretty early that achieves most of these goals: 3 shuffles, 4 compares. One of the shuffles is in-lane. All of them use an immediate control byte instead of a vector.
// returns a 0 or non-zero truth value
int any_conflicts32(__m256i v)
{
__m256i hilo = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(1,0,3,2)); // vpermq is much more efficient than vperm2i128 on Ryzen and KNL, same on HSW/SKL.
__m256i inlane_rotr1 = _mm256_shuffle_epi32(v, _MM_SHUFFLE(0,3,2,1));
__m256i full_rotl2 = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(2,1,0,3));
__m256i v_ir1 = _mm256_cmpeq_epi32(v, inlane_rotr1);
__m256i v_hilo= _mm256_cmpeq_epi32(v, hilo); // only really needs to be a 128b operation on the low lane, with leaving the upper lane zero.
// But there's no ideal way to express that with intrinsics, since _mm256_castsi128_si256 technically leaves the high lane undefined
// It's extremely likely that casting down and back up would always compile to correct code, though (using the result in a zero-extended register).
__m256i hilo_ir1 = _mm256_cmpeq_epi32(hilo, inlane_rotr1);
__m256i v_fl2 = _mm256_cmpeq_epi32(v, full_rotl2);
__m256i t1 = _mm256_or_si256(v_ir1, v_hilo);
__m256i t2 = _mm256_or_si256(t1, v_fl2);
__m256i conflicts = _mm256_or_si256(t2, hilo_ir1); // A serial dep chain instead of a tree is probably good because of resource conflicts from limited shuffle throughput
// if you're going to branch on this, movemask/test/jcc is more efficient than ptest/jcc
unsigned conflict_bitmap = _mm256_movemask_epi8(conflicts); // With these shuffles, positions in the bitmap aren't actually meaningful
return (bool)conflict_bitmap;
return conflict_bitmap;
}
How I designed this:
I made a table of all the element-pairs that needed to be checked, and made columns for which shuffled operands could take care of that requirement.
I started with a few shuffles that could be done cheaply, and it turned out my early guesses worked well enough.
My design notes:
// 7 6 5 4 | 3 2 1 0
// h g f e | d c b a
// e h g f | a d c b // inlanerotr1 = vpshufd(v)
// f e d c | b a h g // fullrotl2 = vpermq(v)
// d c b a | h g f e // hilo = vperm2i128(v) or vpermq. v:hilo has lots of redundancy. The low half has all the information.
v:lrot1 v:frotr2 lrotr1:frotl2 (incomplete)
* ab [0]v:lrotr1 [3]lr1:fl2
* ac [2]v:frotl2
* ad [3]v:lrotr1 [2]lr1:fl2
* ae [0,4]v:hilo
* af [4]hilo:lrotr1
* ag [0]v:frotl2
* ah [3]hilo:lrotr1
* bc [1]v:lrotr1
* bd [3]v:frotl2 [5]hilo:frotl2
* be [0]hilo:lrotr1
* bf [1,5]v:hilo
* bg [0]lr1:fl2 [5]hilo:lrotr1
* bh [1]v:frotl2
* cd [2]v:lrotr1
* ce [4]v:frotl2 [4]lr1:fl2
* cf [1]hilo:lrotr1
* cg [2,6]v:hilo
* ch [1]lr1:fl2 [6]hilo:lrotr1
* de [7]hilo:lrotr1
* df [5]v:frotl2 [7]hilo:frotl2
* dg [5]lr1:fl2 [2]hilo:lrotr1
* dh [3,7]v:hilo
* ef [4]v:lrotr1 [7]lr1:fl2
* eg [6]v:frotl2
* eh [7]v:lrotr1 [6]lr1:fl2
* fg [5]v:lrotr1
* fh [7]v:frotl2
* gh [6]v:lrotr1
*/
It turns out that in-lane rotr1 == full rotl2 has a lot of redundancy, so it's not worth using. It also turns out that having all the allowed redundancy in v==hilo works fine.
If you care about which result is in which element (rather than just checking for presence/absence),
then v == swap_hilo(lrotr1) could work instead of lrotr1 == hilo.
But we also need swap_hilo(v), so this would mean an extra shuffle.
We could instead shuffle after hilo==lrotr1, for better ILP.
Or maybe there's a different set of shuffles that gives us everything.
Maybe if we consider VPERMD with a vector shuffle-control...
Compiler asm output vs. optimal asm
gcc6.3 -O3 -march=haswell produces:
Haswell has one shuffle unit (on port5).
# assume ymm0 ready on cycle 0
vpermq ymm2, ymm0, 78 # hilo ready on cycle 3 (execution started on cycle 0)
vpshufd ymm3, ymm0, 57 # lrotr1 ready on cycle 2 (started on cycle 1)
vpermq ymm1, ymm0, 147 # frotl2 ready on cycle 5 (started on 2)
vpcmpeqd ymm4, ymm2, ymm0 # starts on 3, ready on 4
vpcmpeqd ymm1, ymm1, ymm0 # starts on 5, ready on 6
vpcmpeqd ymm2, ymm2, ymm3 # starts on 3, ready on 4
vpcmpeqd ymm0, ymm0, ymm3 # starts on 2, ready on 3
vpor ymm1, ymm1, ymm4 # starts on 6, ready on 7
vpor ymm0, ymm0, ymm2 # starts on 4, ready on 5
vpor ymm0, ymm1, ymm0 # starts on 7, ready on 8
# a different ordering of VPOR merging could have saved a cycle here. /scold gcc
vpmovmskb eax, ymm0
vzeroupper
ret
So the best-case latency is 8 cycles to have a single vector ready, given resource conflicts from other instructions in this sequence but assuming no conflicts with past instructions still in the pipeline. (Should have been 7 cycles, but gcc re-ordered the dependency structure of my intrinsics putting more stuff dependent on the compare of the last shuffle result.)
This is faster than Skylake-AVX512's vpconflictd ymm, which has 17c latency, one per 10c throughput. (Of course, that gives you much more information, and #harold's emulation of it takes many more instructions).
Fortunately gcc didn't re-order the shuffles and introduce a potential write-back conflict. (e.g. putting the vpshufd last would mean that dispatching the shuffle uops to port5 in oldest-first order would have the vpshufd ready in the same cycle as the first vpermq (1c latency vs. 3c).) gcc did this for one version of the code (where I compared the wrong variable), so it seems that gcc -mtune=haswell doesn't take this into account. (Maybe it's not a big deal, I haven't measured to see what the real effect on latency is. I know the scheduler is smart about picking uops from the Reservation Station to avoid actual write-back conflicts, but IDK how smart it is, i.e. whether it would run the vpshufd ahead of a later vpermq to avoid a write-back conflict, since it would have to look-ahead to even see the upcoming writeback conflict. More likely it would just delay the vpshufd for an extra cycle before dispatching it.)
Anyway, this is why I put _mm_shuffle_epi32 in the middle in the C source, where it makes things easy for OOO execution.
Clang 4.0 goes berserk and packs each compare result down to 128b vectors (with vextracti128 / vpacksswb), then expands back to 256b after three vpor xmm before pmovmskb. I thought at first it was doing this because of -mtune=znver1, but it does it with -mtune=haswell as well. It does this even if we return a bool, which would let it just pmovmskb / test on the packed vector. /facepalm. It also pessimizes the hilo shuffle to vperm2i128, even with -mtune=znver1 (Ryzen), where vperm2i128 is 8 uops but vpermq is 3. (Agner Fog's insn tables for some reasons missed those, so I took those numbers from the FP equivalents vperm2f128 and vpermpd)
#harold says that using add instead of or stops clang from packing/unpacking, but vpaddd has lower throughput than vpor on Intel pre-Skylake.
Even better for Ryzen, the v == hilo compare can do only the low half. (i.e. use vpcmpeqd xmm2, xmm2, xmm3, which is only 1 uop instead of 2). We still need the full hilo for hilo == lrot1, though. So we can't just use vextracti128 xmm2, xmm0, 1 instead of the vpermq shuffle. vextracti128 has excellent performance on Ryzen: 1 uop, 1c latency, 0.33c throughput (can run on any of P0/1/3).
Since we're ORing everything together, it's fine to have zeros instead of redundant compare results in the high half.
As I noted in comments, IDK how to safely write this with intrinsics. The obvious way would be to use _mm256_castsi128_si256 (_mm_cmpeq_epi32(v, hilo)), but that technically leaves the high lane undefined, rather than zero. There's no sane way a compiler would do anything other than use the full-width ymm register that contains the xmm register with the 128b compare result, but it would be legal according to Intel's docs for a Deathstation-9000 compiler to put garbage there. Any explicit way of getting zeros in the high half would depend on the compiler optimizing it away. Maybe _mm256_setr_si128(cmpresult, _mm_setzero_si128());.
There are no current CPUs with AVX512F but not AVX512CD. But if that combo is interesting or relevant, clang makes some interesting asm from my code with -mavx512f -mavx512vl. It uses EVEX vpcmpeqd into mask registers, and korw to merge them. But then it expands that back into a vector to set up for vpmovmaskb, instead of just optimizing away the movemask and using the korw result. /facepalm.

8-bit FFT for CPU architectures?

I am looking for an FFT engine that can handle 8-bit real to complex transforms (of size 65K). The need for this is to accelerate a real-time signal processing engine. It is currently limited by 8-bit -> FP32 and FP32 -> 8-bit conversions, as well as the actual FFT being memory bandwidth bound (we're using FFTW at the moment).
I thought that the Spiral project might be able to do this http://spiral.net, but the only code that seems to be available on their webpage is for single or double transforms.
Anyone know of any C or C++ libraries that can do this?
Sometimes ago I encountered the same problem. FFTW for my dataframe was executed in 14 ms (forward, some calculations, and backward), while straightforward byte (or short) to float array conversion took 12-19 ms. So I've made SSE function to convert bytes to floats (4 elements per cycle), and have got significant speed gain - now conversion is accomplished in 2.2-5 ms.
If you compiler can use autovectorization, try it first.
If not, write simple conversion function with intrinsics.
I've used inline assembler (MOVD, PUNPCKLBW, PUNPCKLWD, CVTDQ2PS, MOVAPS command sequence).
procedure BytesToSingles(Src, Dst: Pointer; Count: Integer);
asm
//EAX = Src pointer to byte array
//EDX = Dst pointer to float array !!! 16 byte-aligned !!!
//ECX = Count (multiple of four)
SHR ECX, 2 // 4 elements per cycle
JZ ##Exit
PXOR XMM7, XMM7 // zeros
##Cycle:
MOVD XMM1, [EAX] // load 4 bytes
PUNPCKLBW XMM1, XMM7 // unpack to words
PUNPCKLWD XMM1, XMM7 // words to int32
CVTDQ2PS XMM0, XMM1 // convert integers to 4 floats
MOVAPS [EDX], XMM0 // store 4 floats to destination array
ADD EAX, 4 // move array pointers
ADD EDX, 16
LOOP ##Cycle
##Exit:
end;
Note that FFT implementation on 8-bit data will suffer from numerical error issues, as Paul R wrote in comment.
You do not want to do all the processing in fixed point. You data will turn to mush in an FFT of that size. Technically, you could use 32bit fixed point and keep all your dynamics, but you'd still have to convert the data and it will be slower than using floats (you tagged SSE, so I assume you are on an intel machine having an FPU). I base my opinions on my work creating kissfft
Focus instead on speeding up the type conversion.
I've not run MBo's assembly code, but it looks like the right approach. I think unrolling might make it faster.
If you are not accustomed to assembly, use SSE2 compiler instrinsics instead. It will be just as fast (assuming decent compiler) and it will make your code more readable and maintainable. This answer will give you most of what you need.