Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics] - c++

I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed:
__m128i oldRect; // contains old left, top, right, bottom packed to 128 bits
__m128i newRect; // contains new left, top, right, bottom packed to 128 bits
__m128i xor = _mm_xor_si128(oldRect, newRect);
At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that?
Currently I am doing so:
if (xor.m128i_u64[0] | xor.m128i_u64[1])
{
// rectangle changed
}
But I assume there's a smarter way (possibly using some SSE instruction that I haven't found yet).
I am targeting SSE4.1 on x64 and I am coding C++ in Visual Studio 2013.
Edit: The question is not quite the same as Is an __m128i variable zero?, as that specifies "on SSE-2-and-earlier processors" (although Antonio did add an answer "for completeness" that addresses 4.1 some time after this question was posted and answered).

You can use the PTEST instuction via the _mm_testz_si128 intrinsic (SSE4.1), like this:
#include "smmintrin.h" // SSE4.1 header
if (!_mm_testz_si128(xor, xor))
{
// rectangle has changed
}
Note that _mm_testz_si128 returns 1 if the bitwise AND of the two arguments is zero.

Ironically, ptest instruction from SSE 4.1 may be slower than pmovmskb from SSE2 in some cases. I suggest using simply:
__m128i cmp = _mm_cmpeq_epi32(oldRect, newRect);
if (_mm_movemask_epi8(cmp) != 0xFFFF)
//registers are different
Note that if you really need that xor value, you'll have to compute it separately.
For Intel processors like Ivy Bridge, the version by PaulR with xor and _mm_testz_si128 translates into 4 uops, while suggested version without computing xor translates into 3 uops (see also this thread). This may result in better throughput of my version.

Related

What is "MAX" referring to in the intel intrinsics documentation?

Within the intel intrinsics guide some operations are defined using a term "MAX". An example is __m256 _mm256_mask_permutexvar_ps (__m256 src, __mmask8 k, __m256i idx, __m256 a), which is defined as
FOR j := 0 to 7
i := j*32
id := idx[i+2:i]*32
IF k[j]
dst[i+31:i] := a[id+31:id]
ELSE
dst[i+31:i] := 0
FI
ENDFOR
dst[MAX:256] := 0
. Please take note of the last line within this definition: dst[MAX:256] := 0. What is MAX referring to and is this line even adding any valuable information? If I had to make assumptions, then MAX probably means the amount of bits within the vector, which is 256 in case of _mm256. This however does not seem to change anything for the definition of the operation and might as well have been omitted. But why is it there then?

This pseudo-code only makes sense for assembly documentation, where it was copied from, not for intrinsics. (HTML scrape of Intel's vol.2 PDF documenting the corresponding vpermps asm instruction.)
...
ENDFOR
DEST[MAXVL-1:VL] ← 0
(The same asm doc entry covers VL = 128, 256, and 512-bit versions, the vector width of the instruction.)
In asm, a YMM register is the low half of a ZMM register, and writing a YMM zeroes the upper bits out to the CPU's max supported vector width (just like writing EAX zero-extends into RAX).
The intrinsic you picked is for the masked version, so it requires AVX-512 (EVEX encoding), thus VLMAX is at least 5121. If the mask is a constant all-ones, it could get optimized to the AVX2 VEX encoding, but both still zero high bits of the full register out to VLMAX.
This is meaningless for intrinsics
The intrinsics API just has __m256 and __m512 types; an __m256 is not implicitly the low half of an __m512. You can use _mm512_castps256_ps512 to get a __m512 with your __m256 as the low half, but the API documentation says "the upper 256 bits of the result are undefined". So if you use it on a function arg, it doesn't force it to vmovaps ymm7, ymm0 or something to zero-extend into a ZMM register in case the caller left high garbage.
If you use _mm512_castps256_ps512 on a __m256 that came from an intrinsic in this function, it pretty much always will happen to compile with a zeroed high half whether it stayed in a reg or got stored/reloaded, but that's not guaranteed by the API. (If the compiler chose to combine a previous calculation with something else, using a 512-bit operation, you could plausibly end up with a non-zero high half.) If you want high zeros, there's no equivalent to _mm256_set_m128 (__m128 hi, __m128 lo), so you need some other explicit way.
Footnote 1: Or with some hypothetical future extension, VLMAX aka MAXVL could be even wider. It's determined by the current value of XCR0. This documentation is telling you these instructions will still zero out to whatever that is.
(I haven't looked into whether changing VLMAX is possible on a machine supporting AVX-512, or if it's read-only. IDK how the CPU would handle it if you can change it, like maybe not running 512-bit instructions at all. Mainstream OSes certainly don't do this even if it's possible with privileged operations.)
SSE didn't have any defined mechanism for extension to wider vectors, and some existing code (notably Windows kernel drivers) manually saved/restored a few XMM registers for their own use. To support that, AVX decided that legacy SSE would leave the high part of YMM/ZMM registers unmodified. But to run existing machine code using non-VEX legacy SSE encodings efficiently, it needed expensive state transitions (Haswell and Ice Lake) and/or false dependencies (Skylake): Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Intel wasn't going to make this mistake again, so they defined AVX as zeroing out to whatever vector width the CPU supports, and document it clearly in every AVX and AVX-512 instruction encoding. Thus VEX and EVEX can be mixed freely, even being useful to save machine-code size:
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
What is the penalty of mixing EVEX and VEX encoded scheme? (none), with an answer discussing more details of why SSE/AVX penalties are a thing.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853 Agner Fog's 2008 post on Intel's forums about AVX, when it was first announced, pointing out the problem created by the lack of foresight with SSE.
Does vzeroall zero registers ymm16 to ymm31? - interestingly no; since they're not accessible via legacy SSE instructions, they can't be part of a dirty-uppers problem.

Bits in the registers are numbered with high indices on the “left” and low indices on the “right”. This matches how we write and talk about binary numerals: 100102 is the binary numeral for 18, with bit number 4, representing 24 = 16, on the left and bit number 0, representing 20 = 1, on the right.
R[m:n] denotes the set of bits of register R from m down to n, with m being the “left” end of the set and n being the “right” end. If m is less than n, then it is the empty set. Therefore, for registers with 512 bits, dst[511:256] := 0 says to set bits 511 to 256 to zero, and, for registers with 256 bits, dst[255:256] := 0 says to do nothing.

dst[MAX:256] := 0 sets all bits above (and including) 256th bit to zero. It is only relevant to registers having more than 256 bits. So MAX can be 256 if the register is 256 bits long or 512 if the processor is using 512 bits registers.

How to unset N right-most set bits

There is a relatively well-known trick for unsetting a single right-most bit:
y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)
I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?
Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).
// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000
I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.
Maybe there is some hardware support for it?

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.
I'm not aware of other ISAs having a similar bit-packing hardware operation.
pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.
But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.
uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
return _pdep_u64(-1ULL << n, a);
}
-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.
On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.
GCC and clang both compile it the obvious way, as written:
# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
mov rax, -1
shlx rax, rax, rsi
pdep rax, rax, rdi
ret
(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)
So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).
And is only 3 uops for the front-end.
A semi-related BMI2 trick:
pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.
Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to #harold for spotting this in the first version of this answer.
Without fast PDEP: perhaps binary search for the right popcount
#Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.
Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).
Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.
High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).
Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

Analog of _mm256_cmp_epi32_mask for AVX2

I have 8 32-bit integers packed into __m256i registers. Now I need to compare corresponding 32-bit values in two registers. Tried
__mmask8 m = _mm256_cmp_epi32_mask(r1, r2, _MM_CMPINT_EQ);
that flags the equal pairs. That would be great, but I got an "illegal instruction" exception, likely because my processor doesn't support AVX512.
Looking for an analogous intrinsic to quickly get indexes of the equal pairs.
Found a work-around (there is no _mm256_movemask_epi32); is the cast legal here?
__m256i diff = _mm256_cmpeq_epi32(m1, m2);
__m256 m256 = _mm256_castsi256_ps(diff);
int i = _mm256_movemask_ps(m256);

Yes, cast intrinsics are just a reinterpret of the bits in the YMM registers, it's 100% legal and yes the asm you want the compiler to emit is vpcmpeqd / vmovmaskps.
Or if you can deal with each bit being repeated 4 times, vpmovmskb also works, _mm256_movemask_epi8. e.g. if you just want to test for any matches (i != 0) or all-matches (i == 0xffffffff) you can avoid using a ps instruction on an integer result which might cost 1 extra cycle of bypass latency in the critical path.
But if that would cost you extra instructions to e.g. scale by 4 after using _mm_tzcnt_u32 to find the element index instead of byte index of the first 1, then use the _ps movemask. The extra instruction will definitely cost latency, and a slot in the pipeline for throughput.

Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register.
The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2):
{(0,0),(1,1),(1,4),(2,5),...}
This means several bytes are copied twice.
I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes.
__m256 _mm256_permute_ps (__m256 a, int imm8)
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
I'm totally confused about the Shuffle Mask in imm8 and how to design it so that it would work as described above.
I had a look in this slides(page 26) were _MM_SHUFFLE is described but I can't find a solution to my problem.
Are there any tutorials on how to design such a mask? Or example functions for the two methods to understand them in depth?
Thanks in advance for hints

TL:DR: you probably either need multiple shuffles to handle lane-crossing, or if your pattern continues exactly like that you can use _mm256_cvtepu16_epi32 (vpmovzxwd) and then _mm256_blend_epi16.
For x86 shuffles (like most SIMD instruction-sets, I think), the destination position is implicit. A shuffle-control constant just has source indices in destination order, whether it's an imm8 that gets compiled+assembled right into an asm instruction or whether it's a vector with an index in each element.
Each destination position reads exactly one source position, but the same source position can be read more than once. Each destination element gets a value from the shuffle source.
See Convert _mm_shuffle_epi32 to C expression for the permutation? for a plain-C version of dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a)), showing how the control byte is used.
(For pshufb / _mm_shuffle_epi8, an element with the high bit set zeros that destination position instead of reading any source element, but other x86 shuffles ignore all the high bits in shuffle-control vectors.)
Without AVX512 merge-masking, there are no shuffles that also blend into a destination. There are some two-source shuffles like _mm256_shuffle_ps (vshufps) which can shuffle together elements from two sources to produce a single result vector. If you wanted to leave some destination elements unwritten, you'll probably have to shuffle and then blend, e.g. with _mm256_blendv_epi8, or if you can use blend with 16-bit granularity you can use a more efficient immediate blend _mm256_blend_epi16, or even better _mm256_blend_epi32 (AVX2 vpblendd is as cheap as _mm256_and_si256 on Intel CPUs, and is the best choice if you do need to blend at all, if it can get the job done; see http://agner.org/optimize/)
For your problem (without AVX512VBMI vpermb in Cannonlake), you can't shuffle single bytes from the low 16 "lane" into the high 16 "lane" of a __m256i vector with a single operation.
AVX shuffles are not like a full 256-bit SIMD, they're more like two 128-bit operations in parallel. The only exceptions are some AVX2 lane-crossing shuffles with 32-bit granularity or larger, like vpermd (_mm256_permutevar8x32_epi32). And also the AVX2 versions of pmovzx / pmovsx, e.g. pmovzxbq does zero-extend the low 4 bytes of an XMM register into the 4 qwords of a YMM register, rather than the low 2 bytes of each half of a YMM register. This makes it much more useful with a memory source operand.
But anyway, the AVX2 version of pshufb (_mm256_shuffle_epi8) does two separate 16x16 byte shuffles in the two lanes of a 256-bit vector.
You're probably going to want something like this:
// Intrinsics have different types for integer, float and double vectors
// the asm uses the same registers either way
__m256i shuffle_and_blend(__m256i dst, __m256i src)
{
// setr takes element in low to high order, like a C array init
// unlike the standard Intel notation where high element is first
const __m256i shuffle_control = _mm256_setr_epi8(
0, 1, -1, -1, 1, 2, ...);
// {(0,0), (1,1), (zero) (1,4), (2,5),...} in your src,dst notation
// Use -1 or 0x80 or anything with the high bit set
// for positions you want to leave unmodified in dst
// blendv uses the high bit as a blend control, so the same vector can do double duty
// maybe need some lane-crossing stuff depending on the pattern of your shuffle.
__m256i shuffled = _mm256_shuffle_epi8(src, shuffle_control);
// or if the pattern continues, and you're just leaving 2 bytes between every 2-byte group:
shuffled = _mm256_cvtepu16_epi32(src); // if src is a __m128i
__m256i blended = _mm256_blendv_epi8(shuffled, dst, shuffle_control);
// blend dst elements we want to keep into the shuffled src result.
return blended;
}
Note that the pshufb numbering restarts from 0 for the 2nd 16 bytes. The two halves of the __m256i can be different, but they can't read elements from the other half. If you need positions in the high lane to get bytes from the low lane, you'll need more shuffling + blending (e.g. including vinserti128 or vperm2i128, or maybe a vpermd lane-crossing dword shuffle) to get all the bytes you need into one 16-byte group in some order.
(Actually _mm256_shuffle_epi8 (PSHUFB) ignores bits 4..6 in a shuffle index, so writing 17 is the same as 1, but very misleading. It's effectively doing a %16, as long as the high bit isn't set. If the high bit is set in the shuffle-control vector, it zeros that element. We don't need that functionality here; _mm256_blendv_epi8 doesn't care about the old value of the element it's replacing)
Anyway, this simple 2-instruction example only works if the pattern doesn't continue. If you want help designing your real shuffles, you'll have to ask a more specific question.
And BTW, I notice that your blend pattern used 2 new bytes then 2 skipped 2. If that continues, you could use vpblendw _mm256_blend_epi16 instead of blendv, because that instruction runs in only 1 uop instead of 2 on Intel CPUs. It would also allow you to use AVX512BW vpermw, a 16-bit shuffle available in current Skylake-AVX512 CPUs, instead of the probably-even-slower AVX512VBMI vpermb.
Or actually, it would maybe let you use vpmovzxwd (_mm256_cvtepu16_epi32) to zero-extend 16-bit elements to 32-bit, as a lane-crossing shuffle. Then blend with dst.

NEON vs Intel SSE - equivalence of certain operations

I'm having some trouble figuring out the NEON equivalence of a couple of Intel SSE operations. It seems that NEON is not capable to handle an entire Q register at once(128 bit value data type). I haven't found anything in the arm_neon.h header or in the NEON intrinsics reference.
What I want to do is the following:
// Intel SSE
// shift the entire 128 bit value with 2 bytes to the right; this is done
// without sign extension by shifting in zeros
__m128i val = _mm_srli_si128(vector_of_8_s16, 2);
// insert the least significant 16 bits of "some_16_bit_val"
// the whole thing in this case, into the selected 16 bit
// integer of vector "val"(the 16 bit element with index 7 in this case)
val = _mm_insert_epi16(val, some_16_bit_val, 7);
I've looked at the shifting operations provided by NEON but could not find an equivalent way of doing the above(I don't have much experience with NEON). Is it possible to do the above(I guess it is I just don't know how)?
Any pointers greatly appreciated.

You want the VEXT instruction. Your example would look something like:
int16x8_t val = vextq_s16(vector_of_8_s16, another_vector_s16, 1);
After this, bits 0-111 of val will contain bits 16-127 of vector_of_8_s16, and bits 112-127 of val will contain bits 0-15 of another_vector_s16.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js