Convert specific SSE intrinsics to NEON intrinsics - c++

> [EDIT: (edited to highlight the question in context)
Following are the SSE intrinsics for which I require NEON intrinsics as I am converting some SSE code to run on iOS.
_mm_set_ps
Sets the four single-precision, floating-point values to the four inputs.
(__m128 _mm_set_ps(float z , float y , float x , float w );)
Return Value:
r0 := w
r1 := x
r2 := y
r3 := z
_mm_loadu_ps
Loads four single-precision, floating-point values. The address does not need to be 16-byte aligned.
__m128 _mm_loadu_ps(float * p);
Return Value:
r0 := p[0]
r1 := p[1]
r2 := p[2]
r3 := p[3]
_mm_storeu_ps
Stores four single-precision, floating-point values. The address does not need to be 16-byte aligned.
void _mm_storeu_ps(float *p, __m128 a);
Return Value:
p[0] := a0
p[1] := a1
p[2] := a2
p[3] := a3
_mm_add_epi32
Adds the 4 signed or unsigned 32-bit integers in a to the 4 signed or unsigned 32-bit integers in b.
__m128i _mm_add_epi32 (__m128i a, __m128i b);
Return Value:
r0 := a0 + b0
r1 := a1 + b1
r2 := a2 + b2
r3 := a3 + b3
Note: Avoid unaligned memory access whenever possible. So, I need a way to convert unaligned access to aligned access (probably using padding).

I'm not very familiar with NEON intrinsics, but I can name you the equivalent NEON instructions. You'll find the appropriate macro easily then.
_mm_set_ps
If the values are already in S registers, you just have to re-interpret them as D registers
Otherwise, you can fill a D register with a vmov instruction :
vmov.i32 d0, r0, r1
_mm_loadu_ps
vld1.32 q0, [r0]
_mm_storeu_ps
vst1.32 q0, [r0]
_mm_add_epi32
vadd.u32 q0, q1, q2

Related

Copy bits of uint64_t into two uint64_t at specific location

I have an input uint64_t X and number of its N least significant bits that I want to write into the target Y, Z uint64_t values starting from bit index M in the Z. Unaffected parts of Y and Z should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?
It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.
M and N are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).
illustration:
There's at least a four instruction sequence available on reasonable modern IA processors.
X &= (1 << (N+1)) - 1; // mask off the upper bits
// bzhi rax, rdi, rdx
Z = X << M;
// shlx rax, rax, rsi
Y = X >> (64 - M);
// neg sil
// shrx rax, rax, rsi
The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M) would need sanitation.
One possibility to overcome this is
x = bzhi(x, n);
y = rol(x,m);
y = bzhi(y, m); // y &= ~(~0ull << m);
z = shlx(x, m); // z = x << m;
As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:
xm = bzhi(~0ull, n);
ym = rol(xm, m);
ym = bzhi(ym, m);
zm = shlx(xm, m);
However, clang seems to produce something like 24 instructions total with the masks applied:
Y = (Y & ~xm) | y; // |,+,^ all possible
Z = (Z & ~zm) | z;
It is likely then better to change the approach:
x2 = x << (64-N); // align xm to left
y2 = y >> y_shift; // align y to right
y = shld(y2,x2, y_shift); // y fixed
Here y_shift = max(0, M+N-64)
Fixing Z is slightly more involved, as Z can be combined of three parts:
zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7
That should be doable with two double shifts as above.

Assembly Code only returns two of possible four values in comparison

I am currently writing a Program that requires C++ and Arm Assembly language to be implemented. The C++ code uses functions from the assembly file to perform particular tasks on a STM32F4discovery board touchscreen. One of the particular functions is a function that returns what quadrant the user is touching currently, and we had to write that function in assembly. No matter what I do, this function always returns quadrant 1 or 3. I think I am having an issue with branching properly, as this is one of my weak points, but I do not know for sure. Here is the function in context of the C++ code:
//all of this is above main
extern "C" uint32_t getQuad(uint16_t, uint16_t);
TS_DISCO_F429ZI ts; //touch screen sensing object
TS_StateTypeDef TS_State;
//other code here
ts.GetState(&TS_State);
uint16_t x = TS_State.X;
uint16_t y = TS_State.Y;
uint32_t quad = getQuad(x, y);
And here is the function in the assembly file
AREA Program3_F20_Gibbs, CODE, READONLY
GLOBAL getQuad
; uint32_t getQuad ( uint16_t x, uint16_t y)
getQuad LDR R0, [R0]
LDR R1, [R1]
CMP R0, #120
BLE TwoThree
CMP R1, #160
BLE Four
MOV R0, #1
BX LR
TwoThree CMP R1, #160
BLE Three
MOV R0, #2
BX LR
Four MOV R0, #4
BX LR
Three MOV R0, #3
BX LR
I should probably also note that this quadrant system follows an x/y coordinate system with a bit of difference. Quadrant one has x/y coordinates with x > 120 and y > 160, Quadrant two has x/y coordinates with x < 120 and y > 160, Quadrant three has x/y coordinates with x < 120 and y < 160, and Quadrant four has x/y coordinates with x > 120 and y < 160. If I need to give any more info, please let me know, and thank you for any help in advance!
You can get away without any (conditional) branch:
getQuad
cmp r0, #120
adr r12, %f1
addls r12, r12, #2
cmp r1, #160
addls r12, r12, #1
ldrb r0, [r12]
bx lr
1
dcb 1, 4, 2, 3
You really don't need to write it in assembly:
static inline uint32_t getQuad(uint16_t x, uint16_t y)
{
static const uint8_t array[4] = {1, 4, 2, 3};
uint8_t *pSrc = array;
if (x <= 120) pSrc += 2;
if (y <= 160) pSrc += 1;
return (uint32_t) *pSrc;
}
It's a very short function, hence you better inline it, and put it in the header file.
The pesky function call overhead is gone then.
PS: le (less of equal) is for signed value. You should us ls (lower or same) instead. le works in this case, but if you are dealing with 32bit values, it could cause disastrous problems that are hard to debug.
Your logic seems sound except for the fact there should be some <= type operations in the text:
Quadrant one has x/y coordinates with x > 120 and y > 160, Quadrant two has x/y coordinates with x <= 120 and y > 160, Quadrant three has x/y coordinates with x <= 120 and y <= 160, and Quadrant four has x/y coordinates with x > 120 and y <= 160.
However, unless the calling convention is very strange, you probably don't want to be doing this at the start:
LDR R0, [R0]
LDR R1, [R1]
This is what you do if you're passing the addresses of variables and you want to dereference those addresses to get the values. Therefore you will be using the wrong values to figure out which quadrant you're in.

Fastest way to perform AVX inner product operations with mixed (float, double) input vectors

I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!
I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)
__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
__m256 X = _mm256_loadu_ps(x + i);
__m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
__m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));
__m256 Y = _mm256_set_m128(yD43, yD21);
return _mm256_fmadd_ps(X, Y, res_prev);
}
I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.
EDIT: this is the disassembly of the function:
vcvtpd2ps xmm3,ymmword ptr [rdx+r8*8+20h]
vcvtpd2ps xmm2,ymmword ptr [rdx+r8*8]
vmovups ymm0,ymmword ptr [rcx+r8*4]
vinsertf128 ymm3,ymm2,xmm3,1
vfmadd213ps ymm0,ymm3,ymmword ptr [r9]
EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+30h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8+20h]
vmovlhps xmm3,xmm1,xmm0
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+10h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8]
vmovlhps xmm2,xmm1,xmm0
vperm2f128 ymm3,ymm2,ymm3,20h
vmulps ymm0,ymm3,ymmword ptr [rcx+r8*4]
vaddps ymm0,ymm0,ymmword ptr [r9]

Improve non-horizontal assignment in AVX

So I've come across another problem when dealing with AVX code. I have a case where I have 4 ymm registers that need to be split vertically to 4 other ymm registers
(ie. ymm0(ABCD) -> ymm4(A...), ymm5(B...), ymm6(C...), ymm7(D...)).
Here is an example:
// a, b, c, d are __m256 structs with [] operators to access xyzw
__m256d A = _mm256_setr_pd(a[0], b[0], c[0], d[0]);
__m256d B = _mm256_setr_pd(a[1], b[1], c[1], d[1]);
__m256d C = _mm256_setr_pd(a[2], b[2], c[2], d[2]);
__m256d D = _mm256_setr_pd(a[3], b[3], c[3], d[3]);
Just putting Paul's comment into an answer:
My question is about how to a matrix transposition which is easily done in AVX as indicated with the link he provided.
Here's my implementation for those who come across here:
void Transpose(__m256d* A, __m256d* T)
{
__m256d t0 = _mm256_shuffle_pd(A[0], A[1], 0b0000);
__m256d t1 = _mm256_shuffle_pd(A[0], A[1], 0b1111);
__m256d t2 = _mm256_shuffle_pd(A[2], A[3], 0b0000);
__m256d t3 = _mm256_shuffle_pd(A[2], A[3], 0b1111);
T[0] = _mm256_permute2f128_pd(t0, t2, 0b0100000);
T[1] = _mm256_permute2f128_pd(t1, t3, 0b0100000);
T[2] = _mm256_permute2f128_pd(t0, t2, 0b0110001);
T[3] = _mm256_permute2f128_pd(t1, t3, 0b0110001);
}
This function cuts the number of instructions in about half on full optimization as compared to my previous attempt

Cannot access memory as SSE type on x86 but works fine on x64

I've got some code written using the MSVC SSE intrinsics.
__m128 zero = _mm_setzero_ps();
__m128 center = _mm_load_ps(&sphere.origin.x);
__m128 boxmin = _mm_load_ps(&rhs.BottomLeftClosest.x);
__m128 boxmax = _mm_load_ps(&rhs.TopRightFurthest.x);
__m128 e = _mm_add_ps(_mm_max_ps(_mm_sub_ps(boxmin, center), zero), _mm_max_ps(_mm_sub_ps(center, boxmax), zero));
e = _mm_mul_ps(e, e);
__declspec(align(16)) float arr[4];
_mm_store_ps(arr, e);
float r = sphere.radius;
return (arr[0] + arr[1] + arr[2] <= r * r);
The Math::Vector type (which is the type of sphere.origin, rhs.BottomLeftClosest, and rhs.TopRightFurthest) is effectively an array of 3 floats. I aligned them to 16 bytes and this code executes fine on x64. But on x86 I get access violation reading a null pointer. Any advice on where this comes from?
__m128 center = _mm_load_ps(&sphere.origin.x);
_mm_load_ps() requires that the passed pointer is 16-byte aligned. There's no evidence that you ensured that sphere.origin.x is aligned properly. You'll need to use _mm_loadu_ps() instead if you can't provide that guarantee.