Shuffling by mask with Intel AVX - c++

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register.
The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2):
{(0,0),(1,1),(1,4),(2,5),...}
This means several bytes are copied twice.
I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes.
__m256 _mm256_permute_ps (__m256 a, int imm8)
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
I'm totally confused about the Shuffle Mask in imm8 and how to design it so that it would work as described above.
I had a look in this slides(page 26) were _MM_SHUFFLE is described but I can't find a solution to my problem.
Are there any tutorials on how to design such a mask? Or example functions for the two methods to understand them in depth?
Thanks in advance for hints

TL:DR: you probably either need multiple shuffles to handle lane-crossing, or if your pattern continues exactly like that you can use _mm256_cvtepu16_epi32 (vpmovzxwd) and then _mm256_blend_epi16.
For x86 shuffles (like most SIMD instruction-sets, I think), the destination position is implicit. A shuffle-control constant just has source indices in destination order, whether it's an imm8 that gets compiled+assembled right into an asm instruction or whether it's a vector with an index in each element.
Each destination position reads exactly one source position, but the same source position can be read more than once. Each destination element gets a value from the shuffle source.
See Convert _mm_shuffle_epi32 to C expression for the permutation? for a plain-C version of dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a)), showing how the control byte is used.
(For pshufb / _mm_shuffle_epi8, an element with the high bit set zeros that destination position instead of reading any source element, but other x86 shuffles ignore all the high bits in shuffle-control vectors.)
Without AVX512 merge-masking, there are no shuffles that also blend into a destination. There are some two-source shuffles like _mm256_shuffle_ps (vshufps) which can shuffle together elements from two sources to produce a single result vector. If you wanted to leave some destination elements unwritten, you'll probably have to shuffle and then blend, e.g. with _mm256_blendv_epi8, or if you can use blend with 16-bit granularity you can use a more efficient immediate blend _mm256_blend_epi16, or even better _mm256_blend_epi32 (AVX2 vpblendd is as cheap as _mm256_and_si256 on Intel CPUs, and is the best choice if you do need to blend at all, if it can get the job done; see http://agner.org/optimize/)
For your problem (without AVX512VBMI vpermb in Cannonlake), you can't shuffle single bytes from the low 16 "lane" into the high 16 "lane" of a __m256i vector with a single operation.
AVX shuffles are not like a full 256-bit SIMD, they're more like two 128-bit operations in parallel. The only exceptions are some AVX2 lane-crossing shuffles with 32-bit granularity or larger, like vpermd (_mm256_permutevar8x32_epi32). And also the AVX2 versions of pmovzx / pmovsx, e.g. pmovzxbq does zero-extend the low 4 bytes of an XMM register into the 4 qwords of a YMM register, rather than the low 2 bytes of each half of a YMM register. This makes it much more useful with a memory source operand.
But anyway, the AVX2 version of pshufb (_mm256_shuffle_epi8) does two separate 16x16 byte shuffles in the two lanes of a 256-bit vector.
You're probably going to want something like this:
// Intrinsics have different types for integer, float and double vectors
// the asm uses the same registers either way
__m256i shuffle_and_blend(__m256i dst, __m256i src)
{
// setr takes element in low to high order, like a C array init
// unlike the standard Intel notation where high element is first
const __m256i shuffle_control = _mm256_setr_epi8(
0, 1, -1, -1, 1, 2, ...);
// {(0,0), (1,1), (zero) (1,4), (2,5),...} in your src,dst notation
// Use -1 or 0x80 or anything with the high bit set
// for positions you want to leave unmodified in dst
// blendv uses the high bit as a blend control, so the same vector can do double duty
// maybe need some lane-crossing stuff depending on the pattern of your shuffle.
__m256i shuffled = _mm256_shuffle_epi8(src, shuffle_control);
// or if the pattern continues, and you're just leaving 2 bytes between every 2-byte group:
shuffled = _mm256_cvtepu16_epi32(src); // if src is a __m128i
__m256i blended = _mm256_blendv_epi8(shuffled, dst, shuffle_control);
// blend dst elements we want to keep into the shuffled src result.
return blended;
}
Note that the pshufb numbering restarts from 0 for the 2nd 16 bytes. The two halves of the __m256i can be different, but they can't read elements from the other half. If you need positions in the high lane to get bytes from the low lane, you'll need more shuffling + blending (e.g. including vinserti128 or vperm2i128, or maybe a vpermd lane-crossing dword shuffle) to get all the bytes you need into one 16-byte group in some order.
(Actually _mm256_shuffle_epi8 (PSHUFB) ignores bits 4..6 in a shuffle index, so writing 17 is the same as 1, but very misleading. It's effectively doing a %16, as long as the high bit isn't set. If the high bit is set in the shuffle-control vector, it zeros that element. We don't need that functionality here; _mm256_blendv_epi8 doesn't care about the old value of the element it's replacing)
Anyway, this simple 2-instruction example only works if the pattern doesn't continue. If you want help designing your real shuffles, you'll have to ask a more specific question.
And BTW, I notice that your blend pattern used 2 new bytes then 2 skipped 2. If that continues, you could use vpblendw _mm256_blend_epi16 instead of blendv, because that instruction runs in only 1 uop instead of 2 on Intel CPUs. It would also allow you to use AVX512BW vpermw, a 16-bit shuffle available in current Skylake-AVX512 CPUs, instead of the probably-even-slower AVX512VBMI vpermb.
Or actually, it would maybe let you use vpmovzxwd (_mm256_cvtepu16_epi32) to zero-extend 16-bit elements to 32-bit, as a lane-crossing shuffle. Then blend with dst.

Related

What is "MAX" referring to in the intel intrinsics documentation?

Within the intel intrinsics guide some operations are defined using a term "MAX". An example is __m256 _mm256_mask_permutexvar_ps (__m256 src, __mmask8 k, __m256i idx, __m256 a), which is defined as
FOR j := 0 to 7
i := j*32
id := idx[i+2:i]*32
IF k[j]
dst[i+31:i] := a[id+31:id]
ELSE
dst[i+31:i] := 0
FI
ENDFOR
dst[MAX:256] := 0
. Please take note of the last line within this definition: dst[MAX:256] := 0. What is MAX referring to and is this line even adding any valuable information? If I had to make assumptions, then MAX probably means the amount of bits within the vector, which is 256 in case of _mm256. This however does not seem to change anything for the definition of the operation and might as well have been omitted. But why is it there then?
This pseudo-code only makes sense for assembly documentation, where it was copied from, not for intrinsics. (HTML scrape of Intel's vol.2 PDF documenting the corresponding vpermps asm instruction.)
...
ENDFOR
DEST[MAXVL-1:VL] ← 0
(The same asm doc entry covers VL = 128, 256, and 512-bit versions, the vector width of the instruction.)
In asm, a YMM register is the low half of a ZMM register, and writing a YMM zeroes the upper bits out to the CPU's max supported vector width (just like writing EAX zero-extends into RAX).
The intrinsic you picked is for the masked version, so it requires AVX-512 (EVEX encoding), thus VLMAX is at least 5121. If the mask is a constant all-ones, it could get optimized to the AVX2 VEX encoding, but both still zero high bits of the full register out to VLMAX.
This is meaningless for intrinsics
The intrinsics API just has __m256 and __m512 types; an __m256 is not implicitly the low half of an __m512. You can use _mm512_castps256_ps512 to get a __m512 with your __m256 as the low half, but the API documentation says "the upper 256 bits of the result are undefined". So if you use it on a function arg, it doesn't force it to vmovaps ymm7, ymm0 or something to zero-extend into a ZMM register in case the caller left high garbage.
If you use _mm512_castps256_ps512 on a __m256 that came from an intrinsic in this function, it pretty much always will happen to compile with a zeroed high half whether it stayed in a reg or got stored/reloaded, but that's not guaranteed by the API. (If the compiler chose to combine a previous calculation with something else, using a 512-bit operation, you could plausibly end up with a non-zero high half.) If you want high zeros, there's no equivalent to _mm256_set_m128 (__m128 hi, __m128 lo), so you need some other explicit way.
Footnote 1: Or with some hypothetical future extension, VLMAX aka MAXVL could be even wider. It's determined by the current value of XCR0. This documentation is telling you these instructions will still zero out to whatever that is.
(I haven't looked into whether changing VLMAX is possible on a machine supporting AVX-512, or if it's read-only. IDK how the CPU would handle it if you can change it, like maybe not running 512-bit instructions at all. Mainstream OSes certainly don't do this even if it's possible with privileged operations.)
SSE didn't have any defined mechanism for extension to wider vectors, and some existing code (notably Windows kernel drivers) manually saved/restored a few XMM registers for their own use. To support that, AVX decided that legacy SSE would leave the high part of YMM/ZMM registers unmodified. But to run existing machine code using non-VEX legacy SSE encodings efficiently, it needed expensive state transitions (Haswell and Ice Lake) and/or false dependencies (Skylake): Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Intel wasn't going to make this mistake again, so they defined AVX as zeroing out to whatever vector width the CPU supports, and document it clearly in every AVX and AVX-512 instruction encoding. Thus VEX and EVEX can be mixed freely, even being useful to save machine-code size:
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
What is the penalty of mixing EVEX and VEX encoded scheme? (none), with an answer discussing more details of why SSE/AVX penalties are a thing.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853 Agner Fog's 2008 post on Intel's forums about AVX, when it was first announced, pointing out the problem created by the lack of foresight with SSE.
Does vzeroall zero registers ymm16 to ymm31? - interestingly no; since they're not accessible via legacy SSE instructions, they can't be part of a dirty-uppers problem.
Bits in the registers are numbered with high indices on the “left” and low indices on the “right”. This matches how we write and talk about binary numerals: 100102 is the binary numeral for 18, with bit number 4, representing 24 = 16, on the left and bit number 0, representing 20 = 1, on the right.
R[m:n] denotes the set of bits of register R from m down to n, with m being the “left” end of the set and n being the “right” end. If m is less than n, then it is the empty set. Therefore, for registers with 512 bits, dst[511:256] := 0 says to set bits 511 to 256 to zero, and, for registers with 256 bits, dst[255:256] := 0 says to do nothing.
dst[MAX:256] := 0 sets all bits above (and including) 256th bit to zero. It is only relevant to registers having more than 256 bits. So MAX can be 256 if the register is 256 bits long or 512 if the processor is using 512 bits registers.

Analog of _mm256_cmp_epi32_mask for AVX2

I have 8 32-bit integers packed into __m256i registers. Now I need to compare corresponding 32-bit values in two registers. Tried
__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);
that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.
Looking for an analogous intrinsic to quickly get indexes of the equal pairs.
Found a work-around (there is no _mm256_movemask_epi32); is the cast legal here?
__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);
Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps.
Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8. e.g. if you just want to test for any matches (i != 0) or all-matches (i == 0xffffffff) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.
But if that would cost you extra instructions to e.g. scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.

SIMD: more generic shuffle function

I think the SIMD shuffle fucntion is not real shuffle for int32_t case the left and right part would be shuffled separately.
I want a real shuffle function as following:
Assumed we got __m256i and we want to shuffle 8 int32_t.
__m256i to_shuffle = _mm256_set_epi32(17, 18, 20, 21, 25, 26, 29, 31);
const int imm8 = 0b10101100;
__m256i shuffled _mm256_shuffle(to_shuffle, imm8);
I hope the shuffled = {17, 20, 25, 26, -, -, -, -}, where the - represents the not relevant value and they can be anything.
So I hope the int at the position with set bit with 1 would be placed in shuffled.
(In our case: 17, 20, 25, 26 are sitting at the positions with a 1 in the imm8).
Is such function offered by the Intel?
How could such function be implemented efficiently?
EDIT: - could be ignored. Only the int with set bit 1 is needed.
(I'm assuming you got your immediate backwards (selector for 17 should be the low bit, not high bit) and your vectors are actually written in low-element-first order).
How could such function be implemented efficiently?
In this case with AVX2 vpermd ( _mm256_permutevar8x32_epi32 ). It needs a control vector not an immediate, to hold 8 selectors for the 8 output elements. So you'd have to load a constant and use that as the control operand.
Since you only care about the low half of your output vector, your vector constant can be only __m128i, saving space. vmovdqa xmm, [mem] zero-extends into the corresponding YMM vector. It's probably inconvenient to write this in C with intrinsics but _mm256_castsi128_si256 should work. Or even _mm256_broadcastsi128_si256 because a broadcast-load would be just as cheap. Still, some compilers might pessimize it to an actual 32-byte constant in memory by doing constant-propagation. If you know assembly, compiler output is frequently disappointing.
If you want to take an actual integer bitmap in your source, you could probably use C++ templates to convert that at compile time into the right vector constant. Agner Fog's Vector Class Library (now Apache-licensed, previously GPL) has some related things like that, turning integer constants into a single blend or sequence of blend instructions depending on the constant and what target ISA is supported, using C++ templates. But its shuffle template takes a list of indices, not a bitmap.
But I think you're trying to ask about why / how x86 shuffles are designed the way they are.
Is such function offered by the Intel?
Yes, in hardware with AVX512F (plus AVX512VL to use it on 256-bit vectors).
You're looking for vpcompressd, the vector-element equivalent of BMI2 pext. (But it takes the control operand as a mask register value, not an immediate constant.) The intrinsic is
__m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a);
It's also available in a version that merges into the bottom of an existing vector instead of zeroing the top elements.
As an immediate shuffle, no.
All x86 shuffles use a control operand that has indices into the source, not a bitmap of which elements to keep. (Except vpcompressd/q and vpexpandd/q). Or they use an implicit control, like _mm256_unpacklo_epi32 for example which interleaves 32-bit elements from 2 inputs (in-lane in the low and high halves).
If you're going to provide a shuffle with a control operand at all, it's usually most useful if any element can end up at any position. So the output doesn't have to be in the same order as the input. Your compress shuffle doesn't have that property.
Also, having a source index for each output element is what shuffle hardware naturally wants. My understanding is that each output element is fed by its own MUX (multiplexer), where the MUX takes N input elements and one binary selector to select which one to output. (And is as wide as the element width of course.) See Where is VPERMB in AVX2? for more discussion of building muxers.
Having the control operand in some format other than a list of selectors would require preprocessing before it could be fed to shuffle hardware.
For an immediate, the format is either 2x1-bit or 4x2-bit fields, or a byte-shift count for _mm_bslli_si128 and _mm_alignr_epi8. Or index + zeroing bitmask for insertps. There are no SIMD instructions with an immediate wider than 8 bits. Presumably this keeps the hardware decoders simple.
(Or 1x1-bit for vextractf128 xmm, ymm, 0 or 1, which in hindsight would be better with no immediate at all. Using it with 0 is always worse than vmovdqa xmm, xmm. Although AVX512 does use the same opcode for vextractf32x4 with an EVEX prefix for the 1x2-bit immediate, so maybe this had some benefit for decoder complexity. Anyway, there are no immediate shuffles with selector fields wider than 2 bits because 8x 3-bit would be 24 bits.)
For wider 4x2 in-lane shuffles like _mm256_shuffle_ps (vshufps ymm, ymm, ymm, imm8), the same 4x2-bit selector pattern is reused for both lanes. For wider 2x1 in-lane shuffles like _mm256_shuffle_pd (vshufpd ymm, ymm, ymm, imm8), we get 4x 1-bit immediate fields that still select in-lane.
There are lane-crossing shuffles with 4x 2-bit selectors, vpermq and vpermpd. Those work exactly like pshufd xmm (_mm_shuffle_epi32) but with 4x qword elements across a 256-bit register instead of 4x dword elements across a 128-bit register.
As far as narrowing / only caring about part of the output:
A normal immediate would need 4x 3-bit selectors to each index one of the 8x 32-bit source elements. But much more likely 8x 3-bit selectors = 24 bits, because why design a shuffle instruction that can only ever write half a half-width output? (Other than vextractf128 xmm, ymm, 1).
General the paradigm for more-granular shuffles is to take a control vector, rather than some funky immediate encoding.
AVX512 did add some narrowing shuffles like VPMOVDB xmm/[mem], x/y/zmm that truncate (or signed/unsigned saturate) 32-bit elements down to 8-bit. (And all other combinations of sizes are available).
They're interesting because they're available with a memory destination. Perhaps this is motivated by some CPUs (like Xeon Phi KNL / KNM) not having AVX512VL, so they can only use AVX512 instructions with ZMM vectors. Still, they have AVX1 and 2 so you could compress into an xmm reg and use a normal VEX-encoded store. But it does allow doing a narrow byte-masked store with AVX512F, which would only be possible with AVX512BW if you had the packed data in an XMM register.
There are some 2-input shuffles like shufps that treat the low and high half of the output separately, e.g. the low half of the output can select from elements of the first source, the high half of the output can select from elements of the second source register.

How to create a 8 bit mask from lsb of __m64 value?

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer?
There is no direct way to do it, but obviously you can simply shift the lsb into the msb and then extract it:
_mm_movemask_pi8(_mm_slli_si64(x, 7))
Using MMX these days is strange and should probably be avoided.
Here is an SSE2 version, still reading only 8 bytes:
int lsb_mask8(uint8_t* bits) {
__m128i x = _mm_loadl_epi64((__m128i*)bits);
return _mm_movemask_epi8(_mm_slli_epi64(x, 7));
}
Using SSE2 instead of MMX avoids the needs for EMMS
If you have efficient BMI2 pext (e.g. Haswell and newer, same as AVX2), then use the inverse of #wim's answer on your question about going the other direction (How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD).
unsigned extract8LSB(uint8_t *arr) {
uint64_t bytes;
memcpy(&bytes, arr, 8);
unsigned LSBs = _pext_u64(bytes ,0x0101010101010101);
return LSBs;
}
This compiles like you'd expect to a qword load + a pext instruction. Compilers will hoist the 0x01... constant setup out of a loop after inlining.
pext / pdep are efficient on Intel CPUs that support them (3 cycle latency / 1c throughput, 1 uop, same as a multiply). But they're not efficient on AMD, like 18c latency and throughput. (https://agner.org/optimize/). If you care about AMD, you should definitely use #harold's pmovmskb answer.
Or if you have multiple contiguous blocks of 8 bytes, do them with a single wide vector, and get a 32-bit bitmap. You can split that up if needed, or unroll the loop using by 4, to right-shift the bitmap to get all 4 single-byte results.
If you're just storing this to memory right away, then you should probably have done this extraction in the loop that wrote the source data, instead of a separate loop, so it would still be hot in cache. AVX2 _mm256_movemask_epi8 is a single uop (on Intel CPUs) with low latency, so if your data isn't hot in L1d cache then a loop that just does this would not be keeping its execution units busy while waiting for memory.

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.
So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:
// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
// ...
// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),
gHigh32Permute); // This seems to take 3 cycles
On Intel, your code would be optimal. One 1-uop instruction is the best you will get. (Except you might want to use vpermps to avoid any risk for int / FP bypass delay, if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to integer instructions is usually fine on Intel, but I'm less sure about feeding the result of an FP instruction to an integer shuffle.)
Although if tuning for Intel, you might try changing the surrounding code so you can shuffle into the bottom 64-bits of each 128b lane, to avoid using a lane-crossing shuffle. (Then you could just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)
With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.
On Ryzen, lane-crossing shuffles are slow. Agner Fog doesn't have numbers for vpermd, but he lists vpermps (which probably uses the same hardware internally) at 3 uops, 5c latency, one per 4c throughput.
vextractf128 xmm, ymm, 1 is very efficient on Ryzen (1c latency, 0.33c throughput), not surprising since it tracks 256b registers as two 128b halves already. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.
This also saves you 2 registers for the 2 vpermps shuffle masks you don't need anymore.
So I'd suggest:
__m256d x = /* computed here */;
// Tuned for Ryzen. Sub-optimal on Intel
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));
__m128 odd = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));
On Intel, using 3 shuffles instead of 2 gives you 2/3rds of the optimal throughput, with 1c extra latency for the first result.