NEON vs Intel SSE - equivalence of certain operations

NEON vs Intel SSE - equivalence of certain operations - c++

I'm having some trouble figuring out the NEON equivalence of a couple of Intel SSE operations. It seems that NEON is not capable to handle an entire Q register at once(128 bit value data type). I haven't found anything in the arm_neon.h header or in the NEON intrinsics reference.
What I want to do is the following:
// Intel SSE
// shift the entire 128 bit value with 2 bytes to the right; this is done
// without sign extension by shifting in zeros
__m128i val = _mm_srli_si128(vector_of_8_s16, 2);
// insert the least significant 16 bits of "some_16_bit_val"
// the whole thing in this case, into the selected 16 bit
// integer of vector "val"(the 16 bit element with index 7 in this case)
val = _mm_insert_epi16(val, some_16_bit_val, 7);
I've looked at the shifting operations provided by NEON but could not find an equivalent way of doing the above(I don't have much experience with NEON). Is it possible to do the above(I guess it is I just don't know how)?
Any pointers greatly appreciated.

You want the VEXT instruction. Your example would look something like:
int16x8_t val = vextq_s16(vector_of_8_s16, another_vector_s16, 1);
After this, bits 0-111 of val will contain bits 16-127 of vector_of_8_s16, and bits 112-127 of val will contain bits 0-15 of another_vector_s16.

Related

How to create a 8 bit mask from lsb of __m64 value?

I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer?

There is no direct way to do it, but obviously you can simply shift the lsb into the msb and then extract it:
_mm_movemask_pi8(_mm_slli_si64(x, 7))
Using MMX these days is strange and should probably be avoided.
Here is an SSE2 version, still reading only 8 bytes:
int lsb_mask8(uint8_t* bits) {
__m128i x = _mm_loadl_epi64((__m128i*)bits);
return _mm_movemask_epi8(_mm_slli_epi64(x, 7));
}
Using SSE2 instead of MMX avoids the needs for EMMS

If you have efficient BMI2 pext (e.g. Haswell and newer, same as AVX2), then use the inverse of #wim's answer on your question about going the other direction (How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD).
unsigned extract8LSB(uint8_t *arr) {
uint64_t bytes;
memcpy(&bytes, arr, 8);
unsigned LSBs = _pext_u64(bytes ,0x0101010101010101);
return LSBs;
}
This compiles like you'd expect to a qword load + a pext instruction. Compilers will hoist the 0x01... constant setup out of a loop after inlining.
pext / pdep are efficient on Intel CPUs that support them (3 cycle latency / 1c throughput, 1 uop, same as a multiply). But they're not efficient on AMD, like 18c latency and throughput. (https://agner.org/optimize/). If you care about AMD, you should definitely use #harold's pmovmskb answer.
Or if you have multiple contiguous blocks of 8 bytes, do them with a single wide vector, and get a 32-bit bitmap. You can split that up if needed, or unroll the loop using by 4, to right-shift the bitmap to get all 4 single-byte results.
If you're just storing this to memory right away, then you should probably have done this extraction in the loop that wrote the source data, instead of a separate loop, so it would still be hot in cache. AVX2 _mm256_movemask_epi8 is a single uop (on Intel CPUs) with low latency, so if your data isn't hot in L1d cache then a loop that just does this would not be keeping its execution units busy while waiting for memory.

Shuffling by mask with Intel AVX

I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register.
The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2):
{(0,0),(1,1),(1,4),(2,5),...}
This means several bytes are copied twice.
I'm not 100% sure which function I should use for this. I tried a bit with these two AVX functions, the second just uses 2 lanes.
__m256 _mm256_permute_ps (__m256 a, int imm8)
__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)
I'm totally confused about the Shuffle Mask in imm8 and how to design it so that it would work as described above.
I had a look in this slides(page 26) were _MM_SHUFFLE is described but I can't find a solution to my problem.
Are there any tutorials on how to design such a mask? Or example functions for the two methods to understand them in depth?
Thanks in advance for hints

TL:DR: you probably either need multiple shuffles to handle lane-crossing, or if your pattern continues exactly like that you can use _mm256_cvtepu16_epi32 (vpmovzxwd) and then _mm256_blend_epi16.
For x86 shuffles (like most SIMD instruction-sets, I think), the destination position is implicit. A shuffle-control constant just has source indices in destination order, whether it's an imm8 that gets compiled+assembled right into an asm instruction or whether it's a vector with an index in each element.
Each destination position reads exactly one source position, but the same source position can be read more than once. Each destination element gets a value from the shuffle source.
See Convert _mm_shuffle_epi32 to C expression for the permutation? for a plain-C version of dst = _mm_shuffle_epi32(src, _MM_SHUFFLE(d,c,b,a)), showing how the control byte is used.
(For pshufb / _mm_shuffle_epi8, an element with the high bit set zeros that destination position instead of reading any source element, but other x86 shuffles ignore all the high bits in shuffle-control vectors.)
Without AVX512 merge-masking, there are no shuffles that also blend into a destination. There are some two-source shuffles like _mm256_shuffle_ps (vshufps) which can shuffle together elements from two sources to produce a single result vector. If you wanted to leave some destination elements unwritten, you'll probably have to shuffle and then blend, e.g. with _mm256_blendv_epi8, or if you can use blend with 16-bit granularity you can use a more efficient immediate blend _mm256_blend_epi16, or even better _mm256_blend_epi32 (AVX2 vpblendd is as cheap as _mm256_and_si256 on Intel CPUs, and is the best choice if you do need to blend at all, if it can get the job done; see http://agner.org/optimize/)
For your problem (without AVX512VBMI vpermb in Cannonlake), you can't shuffle single bytes from the low 16 "lane" into the high 16 "lane" of a __m256i vector with a single operation.
AVX shuffles are not like a full 256-bit SIMD, they're more like two 128-bit operations in parallel. The only exceptions are some AVX2 lane-crossing shuffles with 32-bit granularity or larger, like vpermd (_mm256_permutevar8x32_epi32). And also the AVX2 versions of pmovzx / pmovsx, e.g. pmovzxbq does zero-extend the low 4 bytes of an XMM register into the 4 qwords of a YMM register, rather than the low 2 bytes of each half of a YMM register. This makes it much more useful with a memory source operand.
But anyway, the AVX2 version of pshufb (_mm256_shuffle_epi8) does two separate 16x16 byte shuffles in the two lanes of a 256-bit vector.
You're probably going to want something like this:
// Intrinsics have different types for integer, float and double vectors
// the asm uses the same registers either way
__m256i shuffle_and_blend(__m256i dst, __m256i src)
{
// setr takes element in low to high order, like a C array init
// unlike the standard Intel notation where high element is first
const __m256i shuffle_control = _mm256_setr_epi8(
0, 1, -1, -1, 1, 2, ...);
// {(0,0), (1,1), (zero) (1,4), (2,5),...} in your src,dst notation
// Use -1 or 0x80 or anything with the high bit set
// for positions you want to leave unmodified in dst
// blendv uses the high bit as a blend control, so the same vector can do double duty
// maybe need some lane-crossing stuff depending on the pattern of your shuffle.
__m256i shuffled = _mm256_shuffle_epi8(src, shuffle_control);
// or if the pattern continues, and you're just leaving 2 bytes between every 2-byte group:
shuffled = _mm256_cvtepu16_epi32(src); // if src is a __m128i
__m256i blended = _mm256_blendv_epi8(shuffled, dst, shuffle_control);
// blend dst elements we want to keep into the shuffled src result.
return blended;
}
Note that the pshufb numbering restarts from 0 for the 2nd 16 bytes. The two halves of the __m256i can be different, but they can't read elements from the other half. If you need positions in the high lane to get bytes from the low lane, you'll need more shuffling + blending (e.g. including vinserti128 or vperm2i128, or maybe a vpermd lane-crossing dword shuffle) to get all the bytes you need into one 16-byte group in some order.
(Actually _mm256_shuffle_epi8 (PSHUFB) ignores bits 4..6 in a shuffle index, so writing 17 is the same as 1, but very misleading. It's effectively doing a %16, as long as the high bit isn't set. If the high bit is set in the shuffle-control vector, it zeros that element. We don't need that functionality here; _mm256_blendv_epi8 doesn't care about the old value of the element it's replacing)
Anyway, this simple 2-instruction example only works if the pattern doesn't continue. If you want help designing your real shuffles, you'll have to ask a more specific question.
And BTW, I notice that your blend pattern used 2 new bytes then 2 skipped 2. If that continues, you could use vpblendw _mm256_blend_epi16 instead of blendv, because that instruction runs in only 1 uop instead of 2 on Intel CPUs. It would also allow you to use AVX512BW vpermw, a 16-bit shuffle available in current Skylake-AVX512 CPUs, instead of the probably-even-slower AVX512VBMI vpermb.
Or actually, it would maybe let you use vpmovzxwd (_mm256_cvtepu16_epi32) to zero-extend 16-bit elements to 32-bit, as a lane-crossing shuffle. Then blend with dst.

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed:
__m128i oldRect; // contains old left, top, right, bottom packed to 128 bits
__m128i newRect; // contains new left, top, right, bottom packed to 128 bits
__m128i xor = _mm_xor_si128(oldRect, newRect);
At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that?
Currently I am doing so:
if (xor.m128i_u64[0] | xor.m128i_u64[1])
{
// rectangle changed
}
But I assume there's a smarter way (possibly using some SSE instruction that I haven't found yet).
I am targeting SSE4.1 on x64 and I am coding C++ in Visual Studio 2013.
Edit: The question is not quite the same as Is an __m128i variable zero?, as that specifies "on SSE-2-and-earlier processors" (although Antonio did add an answer "for completeness" that addresses 4.1 some time after this question was posted and answered).

You can use the PTEST instuction via the _mm_testz_si128 intrinsic (SSE4.1), like this:
#include "smmintrin.h" // SSE4.1 header
if (!_mm_testz_si128(xor, xor))
{
// rectangle has changed
}
Note that _mm_testz_si128 returns 1 if the bitwise AND of the two arguments is zero.

Ironically, ptest instruction from SSE 4.1 may be slower than pmovmskb from SSE2 in some cases. I suggest using simply:
__m128i cmp = _mm_cmpeq_epi32(oldRect, newRect);
if (_mm_movemask_epi8(cmp) != 0xFFFF)
//registers are different
Note that if you really need that xor value, you'll have to compute it separately.
For Intel processors like Ivy Bridge, the version by PaulR with xor and _mm_testz_si128 translates into 4 uops, while suggested version without computing xor translates into 3 uops (see also this thread). This may result in better throughput of my version.

Verilog access specific bits

I have problem in accessing 32 most significant and 32 least significant bits in Verilog. I have written the following code but I get the error "Illegal part-select expression" The point here is that I don't have access to a 64 bit register. Could you please help.
`MLT: begin
if (multState==0) begin
{C,Res}<={A*B}[31:0];
multState=1;
end
else
begin
{C,Res}<={A*B}[63:32];
multState=2;
end

Unfortunately the bit-select and part-select features of Verilog are part of expression operands. They are not Verilog operators (see Sec. 5.2.1 of the Verilog 2005 Std. Document, IEEE Std 1364-2005) and can therefore not be applied to arbitrary expressions but only directly to registers or wires.
There are various ways to do what you want but I would recommend using a temporary 64 bit variable:
wire [31:0] A, B;
reg [63:0] tmp;
reg [31:0] ab_lsb, ab_msb;
always #(posedge clk) begin
tmp = A*B;
ab_lsb <= tmp[31:0];
ab_msb <= tmp[63:32];
end
(The assignments to ab_lsb and ab_msb could be conditional. Otherwise a simple "{ab_msb, ab_lsb} <= A*B;" would do the trick as well of course.)
Note that I'm using a blocking assignment to assign 'tmp' as I need the value in the following two lines. This also means that it is unsafe to access 'tmp' from outside
this always block.
Also note that the concatenation hack {A*B} is not needed here, as A*B is assigned to a 64 bit register. This also fits the recommendation in Sec 5.4.1 of IEEE Std 1364-2005:
Multiplication may be performed without losing any overflow bits by assigning the result
to something wide enough to hold it.
However, you said: "The point here is that I don't have access to a 64 bit register".
So I will describe a solution that does not use any Verilog 64 bit registers. This will however not have any impact on the resulting hardware. It will only look different in
the Verilog code.
The idea is to access the MSB bits by shifting the result of A*B. The following naive version of this will not work:
ab_msb <= (A*B) >> 32; // Don't do this -- it won't work!
The reason why this does not work is that the width of A*B is determined by the left hand side of the assignment, which is 32 bits. Therefore the result of A*B will only contain the lower 32 bits of the results.
One way of making the bit width of an operation self-determined is by using the concatenation operator:
ab_msb <= {A*B} >> 32; // Don't do this -- it still won't work!
Now the result width of the multiplication is determined using the max. width of its operands. Unfortunately both operands are 32 bit and therefore we still have a 32 bit multiplication. So we need to extend one operand to be 64 bit, e.g. by appending zeros
(I assume unsigned operands):
ab_msb <= {{32'd0, A}*B} >> 32;
Accessing the lsb bits is easy as this is the default behavior anyways:
ab_lsb <= A*B;
So we end up with the following alternative code:
wire [31:0] A, B;
reg [31:0] ab_lsb, ab_msb;
always #(posedge clk) begin
ab_lsb <= A*B;
ab_msb <= {{32'd0, A}*B} >> 32;
end
Xilinx XST 14.2 generates the same RTL netlist for both versions. I strongly recommend the first version as it is much easier to read and understand. If only 'ab_lsb' or 'ab_msb' is used, the synthesis tool will automatically discard the unused bits of 'tmp'. So there is really no difference.
If this is not the information you where looking for you should probably clarify why and how you "don't have access to 64 bit registers". After all, you try to access the bits [63:32] of a 64 bit value in your code as well. As you can't calculate the upper 32 bits of the product A*B without also performing almost all calculations required for the lower 32 bits, you might be asking for something that is not possible.

You are mixing blocking and non-blocking assignments here:
{C,Res}<={A*B}[63:32]; //< non-blocking
multState=2; //< blocking
this is considered bad practice.
Not sure that a concatenation operation which is just {A*B} is valid. At best it does nothing.
The way you have encoded it looks like you will end up with 2 hardware multipliers. What makes you say you do not have a 64 bit reg, available? reg does not have to imply flip-flops. If you have 2 32bit regs then you could have 1 64 bit one. I would personally do the multiply on 1 line then split the result up and output as 2 32 bit sections.
However :
x <= (a*b)[31:0] is unfortunately not allowed. If x is 32 bits it will take the LSBs, so all you need is :
x <= (a*b)
To take the MSBs you could try:
reg [31:0] throw_away;
{x, throw_away} <= (a*b) ;

Most efficient way to modify a stream of data

I have a stream of 16 bit values, and I need to adjust the 4 least significant bits of each sample. The new values are different for each short, but repeat every X shorts - essentially tagging each short with an ID.
Are there any bit twiddling tricks to do this faster than just a for-loop?
More details
I'm converting a file from one format to another. Currently implemented with FILE* but I could use Windows specific APIs if helpful.
[while data remaining]
{
read X shorts from input
tag 4 LSB's
write modified data to output
}
In addition to bulk operations, I guess I was looking for opinions on the best way to stomp those last 4 bits.
Shift right 4, shift left 4, | in the new values
& in my zero bits, then | in the 1 bits
modulus 16, add new value
We're only supporting win7 (32 or 64) right now, so hardware would be whatever people choose for that.

If you're working on e.g. a 32-bit platform, you can do them 2 at a time. Or on a modern x86 equivalent, you could use SIMD instructions to operate on 128 bits at a time.
Other than that, there are no bit-twiddling methods to avoid looping through your entire data set, given that it sounds like you must modify every element!

Best way to stomp those last 4 bits is your option 2:
int i;
i &= 0xFFF0;
i |= tag;
Doing this on a long would be faster if you know tag values in advance.
You can memcpy 4 shorts in one long and then do the same operations as above on 4 shorts at a time:
long l;
l &= 0xFFF0FFF0FFF0FFF0;
l |= tags;
where tags = (long) tag1 << 48 + (long) tag2 << 32 + (long) tag3 << 16 + (long) tag4;
This has sense if you are reusing this value tags often, not if you have to build it differently for each set of 4 shorts.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js