How to create a 8 bit mask from lsb of __m64 value? - c++

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer?

There is no direct way to do it, but obviously you can simply shift the lsb into the msb and then extract it:
_mm_movemask_pi8(_mm_slli_si64(x, 7))
Using MMX these days is strange and should probably be avoided.
Here is an SSE2 version, still reading only 8 bytes:
int lsb_mask8(uint8_t* bits) {
__m128i x = _mm_loadl_epi64((__m128i*)bits);
return _mm_movemask_epi8(_mm_slli_epi64(x, 7));
}
Using SSE2 instead of MMX avoids the needs for EMMS

If you have efficient BMI2 pext (e.g. Haswell and newer, same as AVX2), then use the inverse of #wim's answer on your question about going the other direction (How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD).
unsigned extract8LSB(uint8_t *arr) {
uint64_t bytes;
memcpy(&bytes, arr, 8);
unsigned LSBs = _pext_u64(bytes ,0x0101010101010101);
return LSBs;
}
This compiles like you'd expect to a qword load + a pext instruction. Compilers will hoist the 0x01... constant setup out of a loop after inlining.
pext / pdep are efficient on Intel CPUs that support them (3 cycle latency / 1c throughput, 1 uop, same as a multiply). But they're not efficient on AMD, like 18c latency and throughput. (https://agner.org/optimize/). If you care about AMD, you should definitely use #harold's pmovmskb answer.
Or if you have multiple contiguous blocks of 8 bytes, do them with a single wide vector, and get a 32-bit bitmap. You can split that up if needed, or unroll the loop using by 4, to right-shift the bitmap to get all 4 single-byte results.
If you're just storing this to memory right away, then you should probably have done this extraction in the loop that wrote the source data, instead of a separate loop, so it would still be hot in cache. AVX2 _mm256_movemask_epi8 is a single uop (on Intel CPUs) with low latency, so if your data isn't hot in L1d cache then a loop that just does this would not be keeping its execution units busy while waiting for memory.

Related

What is "MAX" referring to in the intel intrinsics documentation?

Within the intel intrinsics guide some operations are defined using a term "MAX". An example is __m256 _mm256_mask_permutexvar_ps (__m256 src, __mmask8 k, __m256i idx, __m256 a), which is defined as
FOR j := 0 to 7
i := j*32
id := idx[i+2:i]*32
IF k[j]
dst[i+31:i] := a[id+31:id]
ELSE
dst[i+31:i] := 0
FI
ENDFOR
dst[MAX:256] := 0
. Please take note of the last line within this definition: dst[MAX:256] := 0. What is MAX referring to and is this line even adding any valuable information? If I had to make assumptions, then MAX probably means the amount of bits within the vector, which is 256 in case of _mm256. This however does not seem to change anything for the definition of the operation and might as well have been omitted. But why is it there then?
This pseudo-code only makes sense for assembly documentation, where it was copied from, not for intrinsics. (HTML scrape of Intel's vol.2 PDF documenting the corresponding vpermps asm instruction.)
...
ENDFOR
DEST[MAXVL-1:VL] ← 0
(The same asm doc entry covers VL = 128, 256, and 512-bit versions, the vector width of the instruction.)
In asm, a YMM register is the low half of a ZMM register, and writing a YMM zeroes the upper bits out to the CPU's max supported vector width (just like writing EAX zero-extends into RAX).
The intrinsic you picked is for the masked version, so it requires AVX-512 (EVEX encoding), thus VLMAX is at least 5121. If the mask is a constant all-ones, it could get optimized to the AVX2 VEX encoding, but both still zero high bits of the full register out to VLMAX.
This is meaningless for intrinsics
The intrinsics API just has __m256 and __m512 types; an __m256 is not implicitly the low half of an __m512. You can use _mm512_castps256_ps512 to get a __m512 with your __m256 as the low half, but the API documentation says "the upper 256 bits of the result are undefined". So if you use it on a function arg, it doesn't force it to vmovaps ymm7, ymm0 or something to zero-extend into a ZMM register in case the caller left high garbage.
If you use _mm512_castps256_ps512 on a __m256 that came from an intrinsic in this function, it pretty much always will happen to compile with a zeroed high half whether it stayed in a reg or got stored/reloaded, but that's not guaranteed by the API. (If the compiler chose to combine a previous calculation with something else, using a 512-bit operation, you could plausibly end up with a non-zero high half.) If you want high zeros, there's no equivalent to _mm256_set_m128 (__m128 hi, __m128 lo), so you need some other explicit way.
Footnote 1: Or with some hypothetical future extension, VLMAX aka MAXVL could be even wider. It's determined by the current value of XCR0. This documentation is telling you these instructions will still zero out to whatever that is.
(I haven't looked into whether changing VLMAX is possible on a machine supporting AVX-512, or if it's read-only. IDK how the CPU would handle it if you can change it, like maybe not running 512-bit instructions at all. Mainstream OSes certainly don't do this even if it's possible with privileged operations.)
SSE didn't have any defined mechanism for extension to wider vectors, and some existing code (notably Windows kernel drivers) manually saved/restored a few XMM registers for their own use. To support that, AVX decided that legacy SSE would leave the high part of YMM/ZMM registers unmodified. But to run existing machine code using non-VEX legacy SSE encodings efficiently, it needed expensive state transitions (Haswell and Ice Lake) and/or false dependencies (Skylake): Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Intel wasn't going to make this mistake again, so they defined AVX as zeroing out to whatever vector width the CPU supports, and document it clearly in every AVX and AVX-512 instruction encoding. Thus VEX and EVEX can be mixed freely, even being useful to save machine-code size:
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
What is the penalty of mixing EVEX and VEX encoded scheme? (none), with an answer discussing more details of why SSE/AVX penalties are a thing.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853 Agner Fog's 2008 post on Intel's forums about AVX, when it was first announced, pointing out the problem created by the lack of foresight with SSE.
Does vzeroall zero registers ymm16 to ymm31? - interestingly no; since they're not accessible via legacy SSE instructions, they can't be part of a dirty-uppers problem.
Bits in the registers are numbered with high indices on the “left” and low indices on the “right”. This matches how we write and talk about binary numerals: 100102 is the binary numeral for 18, with bit number 4, representing 24 = 16, on the left and bit number 0, representing 20 = 1, on the right.
R[m:n] denotes the set of bits of register R from m down to n, with m being the “left” end of the set and n being the “right” end. If m is less than n, then it is the empty set. Therefore, for registers with 512 bits, dst[511:256] := 0 says to set bits 511 to 256 to zero, and, for registers with 256 bits, dst[255:256] := 0 says to do nothing.
dst[MAX:256] := 0 sets all bits above (and including) 256th bit to zero. It is only relevant to registers having more than 256 bits. So MAX can be 256 if the register is 256 bits long or 512 if the processor is using 512 bits registers.

Analog of _mm256_cmp_epi32_mask for AVX2

I have 8 32-bit integers packed into __m256i registers. Now I need to compare corresponding 32-bit values in two registers. Tried
__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);
that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.
Looking for an analogous intrinsic to quickly get indexes of the equal pairs.
Found a work-around (there is no _mm256_movemask_epi32); is the cast legal here?
__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);
Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps.
Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8. e.g. if you just want to test for any matches (i != 0) or all-matches (i == 0xffffffff) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.
But if that would cost you extra instructions to e.g. scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.

SIMD: more generic shuffle function

I think the SIMD shuffle fucntion is not real shuffle for int32_t case the left and right part would be shuffled separately.
I want a real shuffle function as following:
Assumed we got __m256i and we want to shuffle 8 int32_t.
__m256i to_shuffle = _mm256_set_epi32(17, 18, 20, 21, 25, 26, 29, 31);
const int imm8 = 0b10101100;
__m256i shuffled _mm256_shuffle(to_shuffle, imm8);
I hope the shuffled = {17, 20, 25, 26, -, -, -, -}, where the - represents the not relevant value and they can be anything.
So I hope the int at the position with set bit with 1 would be placed in shuffled.
(In our case: 17, 20, 25, 26 are sitting at the positions with a 1 in the imm8).
Is such function offered by the Intel?
How could such function be implemented efficiently?
EDIT: - could be ignored. Only the int with set bit 1 is needed.
(I'm assuming you got your immediate backwards (selector for 17 should be the low bit, not high bit) and your vectors are actually written in low-element-first order).
How could such function be implemented efficiently?
In this case with AVX2 vpermd ( _mm256_permutevar8x32_epi32 ). It needs a control vector not an immediate, to hold 8 selectors for the 8 output elements. So you'd have to load a constant and use that as the control operand.
Since you only care about the low half of your output vector, your vector constant can be only __m128i, saving space. vmovdqa xmm, [mem] zero-extends into the corresponding YMM vector. It's probably inconvenient to write this in C with intrinsics but _mm256_castsi128_si256 should work. Or even _mm256_broadcastsi128_si256 because a broadcast-load would be just as cheap. Still, some compilers might pessimize it to an actual 32-byte constant in memory by doing constant-propagation. If you know assembly, compiler output is frequently disappointing.
If you want to take an actual integer bitmap in your source, you could probably use C++ templates to convert that at compile time into the right vector constant. Agner Fog's Vector Class Library (now Apache-licensed, previously GPL) has some related things like that, turning integer constants into a single blend or sequence of blend instructions depending on the constant and what target ISA is supported, using C++ templates. But its shuffle template takes a list of indices, not a bitmap.
But I think you're trying to ask about why / how x86 shuffles are designed the way they are.
Is such function offered by the Intel?
Yes, in hardware with AVX512F (plus AVX512VL to use it on 256-bit vectors).
You're looking for vpcompressd, the vector-element equivalent of BMI2 pext. (But it takes the control operand as a mask register value, not an immediate constant.) The intrinsic is
__m256i _mm256_maskz_compress_epi32( __mmask8 c, __m256i a);
It's also available in a version that merges into the bottom of an existing vector instead of zeroing the top elements.
As an immediate shuffle, no.
All x86 shuffles use a control operand that has indices into the source, not a bitmap of which elements to keep. (Except vpcompressd/q and vpexpandd/q). Or they use an implicit control, like _mm256_unpacklo_epi32 for example which interleaves 32-bit elements from 2 inputs (in-lane in the low and high halves).
If you're going to provide a shuffle with a control operand at all, it's usually most useful if any element can end up at any position. So the output doesn't have to be in the same order as the input. Your compress shuffle doesn't have that property.
Also, having a source index for each output element is what shuffle hardware naturally wants. My understanding is that each output element is fed by its own MUX (multiplexer), where the MUX takes N input elements and one binary selector to select which one to output. (And is as wide as the element width of course.) See Where is VPERMB in AVX2? for more discussion of building muxers.
Having the control operand in some format other than a list of selectors would require preprocessing before it could be fed to shuffle hardware.
For an immediate, the format is either 2x1-bit or 4x2-bit fields, or a byte-shift count for _mm_bslli_si128 and _mm_alignr_epi8. Or index + zeroing bitmask for insertps. There are no SIMD instructions with an immediate wider than 8 bits. Presumably this keeps the hardware decoders simple.
(Or 1x1-bit for vextractf128 xmm, ymm, 0 or 1, which in hindsight would be better with no immediate at all. Using it with 0 is always worse than vmovdqa xmm, xmm. Although AVX512 does use the same opcode for vextractf32x4 with an EVEX prefix for the 1x2-bit immediate, so maybe this had some benefit for decoder complexity. Anyway, there are no immediate shuffles with selector fields wider than 2 bits because 8x 3-bit would be 24 bits.)
For wider 4x2 in-lane shuffles like _mm256_shuffle_ps (vshufps ymm, ymm, ymm, imm8), the same 4x2-bit selector pattern is reused for both lanes. For wider 2x1 in-lane shuffles like _mm256_shuffle_pd (vshufpd ymm, ymm, ymm, imm8), we get 4x 1-bit immediate fields that still select in-lane.
There are lane-crossing shuffles with 4x 2-bit selectors, vpermq and vpermpd. Those work exactly like pshufd xmm (_mm_shuffle_epi32) but with 4x qword elements across a 256-bit register instead of 4x dword elements across a 128-bit register.
As far as narrowing / only caring about part of the output:
A normal immediate would need 4x 3-bit selectors to each index one of the 8x 32-bit source elements. But much more likely 8x 3-bit selectors = 24 bits, because why design a shuffle instruction that can only ever write half a half-width output? (Other than vextractf128 xmm, ymm, 1).
General the paradigm for more-granular shuffles is to take a control vector, rather than some funky immediate encoding.
AVX512 did add some narrowing shuffles like VPMOVDB xmm/[mem], x/y/zmm that truncate (or signed/unsigned saturate) 32-bit elements down to 8-bit. (And all other combinations of sizes are available).
They're interesting because they're available with a memory destination. Perhaps this is motivated by some CPUs (like Xeon Phi KNL / KNM) not having AVX512VL, so they can only use AVX512 instructions with ZMM vectors. Still, they have AVX1 and 2 so you could compress into an xmm reg and use a normal VEX-encoded store. But it does allow doing a narrow byte-masked store with AVX512F, which would only be possible with AVX512BW if you had the packed data in an XMM register.
There are some 2-input shuffles like shufps that treat the low and high half of the output separately, e.g. the low half of the output can select from elements of the first source, the high half of the output can select from elements of the second source register.

fastest way to convert two-bit number to low-memory representation

I have a 56-bit number with potentially two set bits, e.g., 00000000 00000000 00000000 00000000 00000000 00000000 00000011. In other words, two bits are distributed among 56 bits, so that we have bin(56,2)=1540 possible permutations.
I now look for a loss-free mapping of such an 56 bit number to an 11-bit number that can carry 2048 and therefore also 1540. Knowing the structure, this 11-bit number is enough to store the value of my low-density (of ones) 56 bit number.
I want to maximize performance (this function should run millions or even billions of times per second if possible). So far, I only came up with some loop:
int inputNumber = 24; // 11000
int bitMask = 1;
int bit1 = 0, bit2 = 0;
for(int n = 0; n < 54; ++n, bitMask *= 2)
{
if((inputNumber & bitMask) != 0)
{
if(bit1 != 0)
bit1 = n;
else
{
bit2 = n;
break;
}
}
}
and using these two bits, I can easily generate some 1540 max number.
But is there no faster version than using such a loop?
Most ISAs have hardware support for a bit-scan instruction that finds the position of a set bit. Use that instead of a naive loop or bithack for any architecture where you care about this running fast. https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious has some tricks that are better than nothing, but those are all still much worse than a single efficient asm instruction.
But ISO C++ doesn't portably expose clz/ctz operations; it's only available via intrinsics / builtins for various implementations. (And the x86 intrinsincs have quirks for all-zero input, corresponding to the asm instruction behaviour).
For some ISAs, it's a count-leading-zeros giving you 31 - highbit_index. For others, it's a CTZ count trailing zeros operation, giving you the index of the low bit. x86 has both. (And its high-bit finder actually directly finds the high-bit index, not a leading-zero count, unless you use BMI1 lzcnt instead of traditional bsr) https://en.wikipedia.org/wiki/Find_first_set has a table of what different ISAs have.
GCC portably provides __builtin_clz and __builtin_ctz; on ISAs without hardware support, they compile to a call to a helper functions. See What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? and Implementation of __builtin_clz
(For 64-bit integers, you want the long long versions: like __builtin_ctzll GCC manual.)
If we only have a CLZ, use high=63-CLZ(n) and low= 63-CLZ((-n) & n) to isolate the low bit. Note that x86's bsr instruction actually produces 63-CLZ(), i.e. the bit-index instead of the leading-zero count. So 63-__builtin_clzll(n) can compile to a single instruction on x86; IIRC gcc does notice this. Or 2 instructions if GCC uses an extra xor-zeroing to avoid the inconvenient false dependency.
If we only have CTZ, do low = CTZ(n) and high = CTZ(n & (n - 1)) to clear the lowest set bit. (Leaving the high bit, assuming the number has exactly 2 set bits).
If we have both, low = CTZ(n) and high = 63-CLZ(n). I'm not sure what GCC does on non-x86 ISAs where they aren't both available natively. The GCC builtins are always available even when targeting HW that doesn't have it. But the internal implementation can't use the above tricks because it doesn't know there are always exactly 2 bits set.
(I wrote out the full formulas; an earlier version of this answer had CLZ and CTZ reversed in this part. I find that happens to me easily, especially when I also have to keep track of x86's bsr and bsr (bitscan reverse and forward) and remember that those are leading and trailing, respectively.)
So if you just use both CTZ and CLZ, you might end up with slow emulation for one of them. Or fast emulation on ARM with rbit to bit-reverse for clz, which is 100% fine.
AVX512CD has SIMD VPLZCNTQ for 64-bit integers, so you could encode 2, 4, or 8x 64-bit integers in parallel with that on recent Intel CPUs. For SSSE3 or AVX2, you can build a SIMD lzcnt by using pshufb _mm_shuffle_epi8 byte-shuffle as a 4-bit LUT and combining with _mm_max_epu8. There was a recent Q&A about this but I can't find it. (It might have been for 16-bit integers only; wider requires more work.)
With this, a Skylake-X or Cascade Lake CPU could maybe compress 8x 64-bit integers per 2 or 3 clock cycles once you factor in the throughput cost of packing the results. SIMD is certainly useful for packing 12-bit or 11-bit results into a contiguous bitstream, e.g. with variable-shift instructions, if that's what you want to do with the results. At ~3 or 4GHz clock speed, that could maybe get you over 10 billion per clock with a single thread. But only if the inputs come from contiguous memory. Depending what you want to do with the results, it might cost a few more cycles to do more than just pack them down to 16-bit integers. e.g. to pack into a bitstream. But SIMD should be good for that with variable-shift instructions that can line up the 11 or 12 bits from each register into the right position to OR together after shuffling.
There's a tradeoff between coding efficiency and encode performance. Using 12 bits for two 6-bit indices (of bit positions) is very simple both to compress and decompress, at least on hardware that has bit-scan instructions.
Or instead of bit-indices, one or both could be leading zero counts, so decoding would be (1ULL << 63) >> a. 1ULL>>63 is a fixed constant that you can actually right-shift, or the compiler could turn it into a left-shift of 1ULL << (63-a) which IIRC optimizes to 1 << (-a) in assembly for ISAs like x86 where shift instructions mask the shift count (look only at the low 6 bits).
Also, 2x 12 bits is a whole number of bytes, but 11 bits only gives you a whole number of bytes every 8 outputs, if you're packing them. So indexing a bit-packed array is simpler.
0 is still a special case: maybe handle that by using all-ones bit-indices (i.e. index = bit 63, which is outside the low 56 bits). On decode/decompress, you set the 2 bit positions (1ULL<<a) | (1ULL<<b) and then & mask to clear high bits. Or bias your bit indices and have decode right shift by 1.
If we didn't have to handle zero then a modern x86 CPU could do 1 or 2 billion encodes per second if it didn't have to do anything else. e.g. Skylake has 1 per clock throughput for bit-scan instructions and should be able to encode at 1 number per 2 clocks just bottlenecked on that. (Or maybe better with SIMD). With just 4 scalar instructions, we can get the low and high indices (64-bit tzcnt + bsr), shift by 6 bits, and OR together.1 Or on AMD, avoid bsr / bsf and manually do 63-lzcnt.
A branchy or branchless check for input == 0 to to set the final result to whatever hard-coded constant (like 63 , 63) should be cheap, though.
Compression on other ISAs like AArch64 is also cheap. It has clz but not ctz. Probably your best bet there is use an intrinsic for rbit to bit-reverse a number (so clz on the bit-reversed number directly gives you the bit-index of the low bit. Which is now the high bit of the reversed version.) Assuming rbit is as fast as add / sub, this is cheaper than using multiple instructions to clear the low bit.
If you really want 11 bits then you need to avoid the redundancy of 2x 6-bit being able to have either index larger than the other. Like maybe have 6-bit a and 5-bit b, and have a<=b mean something special like b+=32. I haven't thought this through fully. You need to be able to encode 2 adjacent bits either near the top or bottom of the registers, or the 2 set bits could be as far apart as 28 bits, if we consider wrapping at the boundaries like a 56-bit rotate.
Melpomene's suggestion to isolate the low and high set bits might be useful as part of something else, but is only useful for encoding on targets where you only have one direction of bit-scan available, not both. Even so, you wouldn't actually use both expressions. Leading-zero count doesn't require you to isolate the low bit, you just need to clear it to get at the high bit.
Footnote 1: decoding on x86 is also cheap: x |= (1<<a) is 1 instruction: bts. But many compilers have missed optimizations and don't notice this, instead actually shifting a 1. bts reg, reg is 1 uop / 1 cycle latency on Intel since PPro, or sometimes 2 uops on AMD. (Only the memory destination version is slow.) https://agner.org/optimize/
Best encoding performance on AMD CPUs requires BMI1 tzcnt / lzcnt because bsr and bsf are slower (6 uops instead of 1 https://agner.org/optimize/). On Ryzen, lzcnt is 1 uop, 1c latency, 4 per clock throughput. But tzcnt is 2 uops.
With BMI1, the compiler could use blsr to clear the lowest set bit of a register (and copy it). i.e. modern x86 has an instruction for dst = (SRC-1) bitwiseAND ( SRC ); that are single-uop on Intel but 2 uops on AMD.
But with lzcnt being more efficient than tzcnt on AMD Ryzen, probably the best asm for AMD doesn't use it.
Or maybe something like this (assuming exactly 2 bits, which apparently we can do).
(This asm is what you'd like to get your compiler to emit. Don't actually use inline asm!)
Ryzen_encode_scalar: ; input in RDI, output in EAX
lzcnt rcx, rdi ; 63-high bit index
tzcnt rdx, rdi ; low bit
mov eax, 63
sub eax, ecx
shl edx, 6
or eax, edx ; (low_bit << 6) | high_bit
ret ; goes away with inlining.
Shifting the low bit-index balances the lengths of the critical path, allowing better instruction-level parallelism, if we need 63-CLZ for the high bit.
Throughput: 7 uops total, and no execution-unit bottlenecks. So at 5 uops per clock pipeline width, that's better than 1 per 2 clocks.
Skylake_encode_scalar: ; input in RDI, output in EAX
tzcnt rax, rdi ; low bit. No false dependency on Skylake. GCC will probably xor-zero RAX because there is on Broadwell and earlier.
bsr rdi, rdi ; high bit index. same,same reg avoids false dep
shl eax, 6
or eax, edx
ret ; goes away with inlining.
This has 5 cycle latency from input to output: bitscan instructions are 3 cycles on Intel vs. 1 on AMD. SHL + OR each add 1 cycle.
For throughput, we only bottleneck on one bit-scan per cycle (execution port 1), so we can do one encode per 2 cycles with 4 uops of front-end bandwidth left over for load, store, and loop overhead (or something else), assuming we have multiple independent encodes to do.
(But for the multiple independent encode case, SIMD may still be better for both AMD and Intel, if a cheap emulation of vplzcntq exists and the data is coming from memory.)
Scalar decode can be something like this:
decode: ;; input in EDI, output in RAX
xor eax, eax ; RAX=0
bts rax, rdi ; RAX |= 1ULL << (high_bit_idx & 63)
shr edi, 6 ; extract low_bit_idx
bts rax, rdi ; RAX |= 1ULL << low_bit_idx
ret
This has 3 shifts (including the bts) which on Skylake can only run on port0 or port6. So on Intel it only costs 4 uops for the front-end (so 1 per clock as part of doing something else). But if doing only this, it bottlenecks on shift throughput at 1 decode per 1.5 clock cycles.
On a 4GHz CPU, that's 2.666 billion decodes per second, so yeah we're doing pretty well hitting your targets :)
Or Ryzen, bts reg,reg is 2 uops , with 0.5c throughput, but shr can run on any port. So it doesn't steal throughput from bts, and the whole thing is 6 uops (vs. Ryzen's pipeline being 5-wide at the narrowest point). So 1 encode per 1.2 clock cycles, just bottlenecked on front-end cost.
With BMI2 available, starting with a 1 in a register and using shlx rax, rbx, rdi can replace the xor-zeroing + first BTS with a single uop, assuming the 1 in a register can be reused in a loop.
(This optimization is totally dependent on your compiler to find; flag-less shifts are just more efficient ways to copy-and-shift that become available with -march=haswell or -march=znver1, or other targets that have BMI2.)
Either way you're just going to write retval = 1ULL << (packed & 63) for decoding the first bit. But if you're wondering which compilers make nice code here, this is what you're looking for.

Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register.
The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2):
{(0,0),(1,1),(1,4),(2,5),...}
This means several bytes are copied twice.
I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes.
__m256 _mm256_permute_ps (__m256 a, int imm8)
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
I'm totally confused about the Shuffle Mask in imm8 and how to design it so that it would work as described above.
I had a look in this slides(page 26) were _MM_SHUFFLE is described but I can't find a solution to my problem.
Are there any tutorials on how to design such a mask? Or example functions for the two methods to understand them in depth?
Thanks in advance for hints
TL:DR: you probably either need multiple shuffles to handle lane-crossing, or if your pattern continues exactly like that you can use _mm256_cvtepu16_epi32 (vpmovzxwd) and then _mm256_blend_epi16.
For x86 shuffles (like most SIMD instruction-sets, I think), the destination position is implicit. A shuffle-control constant just has source indices in destination order, whether it's an imm8 that gets compiled+assembled right into an asm instruction or whether it's a vector with an index in each element.
Each destination position reads exactly one source position, but the same source position can be read more than once. Each destination element gets a value from the shuffle source.
See Convert _mm_shuffle_epi32 to C expression for the permutation? for a plain-C version of dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a)), showing how the control byte is used.
(For pshufb / _mm_shuffle_epi8, an element with the high bit set zeros that destination position instead of reading any source element, but other x86 shuffles ignore all the high bits in shuffle-control vectors.)
Without AVX512 merge-masking, there are no shuffles that also blend into a destination. There are some two-source shuffles like _mm256_shuffle_ps (vshufps) which can shuffle together elements from two sources to produce a single result vector. If you wanted to leave some destination elements unwritten, you'll probably have to shuffle and then blend, e.g. with _mm256_blendv_epi8, or if you can use blend with 16-bit granularity you can use a more efficient immediate blend _mm256_blend_epi16, or even better _mm256_blend_epi32 (AVX2 vpblendd is as cheap as _mm256_and_si256 on Intel CPUs, and is the best choice if you do need to blend at all, if it can get the job done; see http://agner.org/optimize/)
For your problem (without AVX512VBMI vpermb in Cannonlake), you can't shuffle single bytes from the low 16 "lane" into the high 16 "lane" of a __m256i vector with a single operation.
AVX shuffles are not like a full 256-bit SIMD, they're more like two 128-bit operations in parallel. The only exceptions are some AVX2 lane-crossing shuffles with 32-bit granularity or larger, like vpermd (_mm256_permutevar8x32_epi32). And also the AVX2 versions of pmovzx / pmovsx, e.g. pmovzxbq does zero-extend the low 4 bytes of an XMM register into the 4 qwords of a YMM register, rather than the low 2 bytes of each half of a YMM register. This makes it much more useful with a memory source operand.
But anyway, the AVX2 version of pshufb (_mm256_shuffle_epi8) does two separate 16x16 byte shuffles in the two lanes of a 256-bit vector.
You're probably going to want something like this:
// Intrinsics have different types for integer, float and double vectors
// the asm uses the same registers either way
__m256i shuffle_and_blend(__m256i dst, __m256i src)
{
// setr takes element in low to high order, like a C array init
// unlike the standard Intel notation where high element is first
const __m256i shuffle_control = _mm256_setr_epi8(
0, 1, -1, -1, 1, 2, ...);
// {(0,0), (1,1), (zero) (1,4), (2,5),...} in your src,dst notation
// Use -1 or 0x80 or anything with the high bit set
// for positions you want to leave unmodified in dst
// blendv uses the high bit as a blend control, so the same vector can do double duty
// maybe need some lane-crossing stuff depending on the pattern of your shuffle.
__m256i shuffled = _mm256_shuffle_epi8(src, shuffle_control);
// or if the pattern continues, and you're just leaving 2 bytes between every 2-byte group:
shuffled = _mm256_cvtepu16_epi32(src); // if src is a __m128i
__m256i blended = _mm256_blendv_epi8(shuffled, dst, shuffle_control);
// blend dst elements we want to keep into the shuffled src result.
return blended;
}
Note that the pshufb numbering restarts from 0 for the 2nd 16 bytes. The two halves of the __m256i can be different, but they can't read elements from the other half. If you need positions in the high lane to get bytes from the low lane, you'll need more shuffling + blending (e.g. including vinserti128 or vperm2i128, or maybe a vpermd lane-crossing dword shuffle) to get all the bytes you need into one 16-byte group in some order.
(Actually _mm256_shuffle_epi8 (PSHUFB) ignores bits 4..6 in a shuffle index, so writing 17 is the same as 1, but very misleading. It's effectively doing a %16, as long as the high bit isn't set. If the high bit is set in the shuffle-control vector, it zeros that element. We don't need that functionality here; _mm256_blendv_epi8 doesn't care about the old value of the element it's replacing)
Anyway, this simple 2-instruction example only works if the pattern doesn't continue. If you want help designing your real shuffles, you'll have to ask a more specific question.
And BTW, I notice that your blend pattern used 2 new bytes then 2 skipped 2. If that continues, you could use vpblendw _mm256_blend_epi16 instead of blendv, because that instruction runs in only 1 uop instead of 2 on Intel CPUs. It would also allow you to use AVX512BW vpermw, a 16-bit shuffle available in current Skylake-AVX512 CPUs, instead of the probably-even-slower AVX512VBMI vpermb.
Or actually, it would maybe let you use vpmovzxwd (_mm256_cvtepu16_epi32) to zero-extend 16-bit elements to 32-bit, as a lane-crossing shuffle. Then blend with dst.