How can I add together two SSE registers - c++

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word.
Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.

To add two 128-bit numbers x and y to give z with SSE you can do it like this
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.
The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__ // AMD XOP instruction set
return _mm_comgt_epu64(b,a));
#else // SSE2 instruction set
__m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword
__m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped
__m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped
__m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords
__m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords
__m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords
__m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger
__m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part
__m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part
return big;
#endif
}
int main() {
__m128i x,y,z,c;
x = _mm_set_epi64x(3,0xffffffffffffffffll);
y = _mm_set_epi64x(1,0x2ll);
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
int out[4];
//int64_t out[2];
_mm_storeu_si128((__m128i*)out, z);
printf("%d %d\n", out[2], out[0]);
}
Edit:
The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.
The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:
__m256i x4, y4, z4;
uint64_t x[4], uint64_t y[4], uint64_t z[4]
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);
void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
uint64_t c1 = 0, c2 = 0, tmp;
//add low 128-bits
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
c1 += z[1]<x[1];
tmp = z[1];
z[1] += z[0]<x[0];
c1 += z[1]<tmp;
//add high 128-bits + carry from low 128-bits
z[2] = x[2] + y[2];
c2 += z[2]<x[2];
tmp = z[2];
z[2] += c1;
c2 += z[2]<tmp;
z[3] = x[3] + y[3] + c2;
}
int main() {
uint64_t x[4], y[4], z[4];
x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,1,0)
//x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
//y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,0,0)
add_u256(x,y,z);
for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}
Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.
__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

Related

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:
_mm_cvtpd_epi64()
_mm_cvtepi64_pd()
It seems that AVX doesn't have them either.
What is the most efficient way to simulate these intrinsics?
There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.
If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)
If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:
If you don't care about infinity or NaN.
For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
For double <-> uint64_t, you only care about values in the range [0, 2^52).
double -> uint64_t
// Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}
double -> int64_t
// Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
return _mm_sub_epi64(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
);
}
uint64_t -> double
// Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}
int64_t -> double
// Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}
Rounding Behavior:
For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.
How does it work?
Despite this trick being only 2 instructions, it's not entirely self-explanatory.
The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.
To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.
Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.
Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.
The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.
Related: A fast method to round a double to a 32-bit int explained
Full Range int64 -> double:
After many years, I finally had a need for this.
5 instructions for uint64_t -> double
6 instructions for int64_t -> double
uint64_t -> double
__m128d uint64_to_double_full(__m128i x){
__m128i xH = _mm_srli_epi64(x, 32);
xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.))); // 2^84
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.)); // 2^84 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}
int64_t -> double
__m128d int64_to_double_full(__m128i x){
__m128i xH = _mm_srai_epi32(x, 16);
xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.))); // 3*2^67
__m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88); // 2^52
__m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.)); // 3*2^67 + 2^52
return _mm_add_pd(f, _mm_castsi128_pd(xL));
}
These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.
These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.
This answer is about 64 bit integer to double conversion, without cutting corners. In a previous version of this answer (see paragraph Fast and accurate conversion by splitting ...., below),
it was shown that
it is quite efficient to split the 64-bit integers in a 32-bit low and a 32-bit high part,
convert these parts to double, and compute low + high * 2^32.
The instruction counts of these conversions were:
int64_to_double_full_range 9 instructions (with mul and add as one fma)
uint64_to_double_full_range 7 instructions (with mul and add as one fma)
Inspired by Mysticial's updated answer, with better optimized accurate conversions,
I further optimized the int64_t to double conversion:
int64_to_double_fast_precise: 5 instructions.
uint64_to_double_fast_precise: 5 instructions.
The int64_to_double_fast_precise conversion takes one instruction less than Mysticial's solution.
The uint64_to_double_fast_precise code is essentially identical to Mysticial's solution (but with a vpblendd
instead of vpblendw). It is included here because of its similarities with the
int64_to_double_fast_precise conversion: The instructions are identical, only the constants differ:
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
__m256d int64_to_double_fast_precise(const __m256i v)
/* Optimized full range int64_t to double conversion */
/* Emulate _mm256_cvtepi64_pd() */
{
__m256i magic_i_lo = _mm256_set1_epi64x(0x4330000000000000); /* 2^52 encoded as floating-point */
__m256i magic_i_hi32 = _mm256_set1_epi64x(0x4530000080000000); /* 2^84 + 2^63 encoded as floating-point */
__m256i magic_i_all = _mm256_set1_epi64x(0x4530000080100000); /* 2^84 + 2^63 + 2^52 encoded as floating-point */
__m256d magic_d_all = _mm256_castsi256_pd(magic_i_all);
__m256i v_lo = _mm256_blend_epi32(magic_i_lo, v, 0b01010101); /* Blend the 32 lowest significant bits of v with magic_int_lo */
__m256i v_hi = _mm256_srli_epi64(v, 32); /* Extract the 32 most significant bits of v */
v_hi = _mm256_xor_si256(v_hi, magic_i_hi32); /* Flip the msb of v_hi and blend with 0x45300000 */
__m256d v_hi_dbl = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_all); /* Compute in double precision: */
__m256d result = _mm256_add_pd(v_hi_dbl, _mm256_castsi256_pd(v_lo)); /* (v_hi - magic_d_all) + v_lo Do not assume associativity of floating point addition !! */
return result; /* With gcc use -O3, then -fno-associative-math is default. Do not use -Ofast, which enables -fassociative-math! */
/* With icc use -fp-model precise */
}
__m256d uint64_to_double_fast_precise(const __m256i v)
/* Optimized full range uint64_t to double conversion */
/* This code is essentially identical to Mysticial's solution. */
/* Emulate _mm256_cvtepu64_pd() */
{
__m256i magic_i_lo = _mm256_set1_epi64x(0x4330000000000000); /* 2^52 encoded as floating-point */
__m256i magic_i_hi32 = _mm256_set1_epi64x(0x4530000000000000); /* 2^84 encoded as floating-point */
__m256i magic_i_all = _mm256_set1_epi64x(0x4530000000100000); /* 2^84 + 2^52 encoded as floating-point */
__m256d magic_d_all = _mm256_castsi256_pd(magic_i_all);
__m256i v_lo = _mm256_blend_epi32(magic_i_lo, v, 0b01010101); /* Blend the 32 lowest significant bits of v with magic_int_lo */
__m256i v_hi = _mm256_srli_epi64(v, 32); /* Extract the 32 most significant bits of v */
v_hi = _mm256_xor_si256(v_hi, magic_i_hi32); /* Blend v_hi with 0x45300000 */
__m256d v_hi_dbl = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_all); /* Compute in double precision: */
__m256d result = _mm256_add_pd(v_hi_dbl, _mm256_castsi256_pd(v_lo)); /* (v_hi - magic_d_all) + v_lo Do not assume associativity of floating point addition !! */
return result; /* With gcc use -O3, then -fno-associative-math is default. Do not use -Ofast, which enables -fassociative-math! */
/* With icc use -fp-model precise */
}
int main(){
int i;
uint64_t j;
__m256i j_4;
__m256d v;
double x[4];
double x0, x1, a0, a1;
j = 0ull;
printf("\nAccurate int64_to_double\n");
for (i = 0; i < 260; i++){
j_4= _mm256_set_epi64x(0, 0, -j, j);
v = int64_to_double_fast_precise(j_4);
_mm256_storeu_pd(x,v);
x0 = x[0];
x1 = x[1];
a0 = _mm_cvtsd_f64(_mm_cvtsi64_sd(_mm_setzero_pd(),j));
a1 = _mm_cvtsd_f64(_mm_cvtsi64_sd(_mm_setzero_pd(),-j));
printf(" j =%21li v =%23.1f v=%23.1f -v=%23.1f -v=%23.1f d=%.1f d=%.1f\n", j, x0, a0, x1, a1, x0-a0, x1-a1);
j = j+(j>>2)-(j>>5)+1ull;
}
j = 0ull;
printf("\nAccurate uint64_to_double\n");
for (i = 0; i < 260; i++){
if (i==258){j=-1;}
if (i==259){j=-2;}
j_4= _mm256_set_epi64x(0, 0, -j, j);
v = uint64_to_double_fast_precise(j_4);
_mm256_storeu_pd(x,v);
x0 = x[0];
x1 = x[1];
a0 = (double)((uint64_t)j);
a1 = (double)((uint64_t)-j);
printf(" j =%21li v =%23.1f v=%23.1f -v=%23.1f -v=%23.1f d=%.1f d=%.1f\n", j, x0, a0, x1, a1, x0-a0, x1-a1);
j = j+(j>>2)-(j>>5)+1ull;
}
return 0;
}
The conversions may fail if unsafe math optimization options are enabled. With gcc, -O3 is
safe, but -Ofast may lead to wrong results, because we may not assume associativity
of floating point addition here (the same holds for Mysticial's conversions).
With icc use -fp-model precise.
Fast and accurate conversion by splitting the 64-bit integers in a 32-bit low and a 32-bit high part.
We assume that both the integer input and the double output are in 256 bit wide AVX registers.
Two approaches are considered:
int64_to_double_based_on_cvtsi2sd(): as suggested in the comments on the question, use cvtsi2sd 4 times together with some data shuffling.
Unfortunately both cvtsi2sd and the data shuffling instructions need execution port 5. This limits the performance of this approach.
int64_to_double_full_range(): we can use Mysticial's fast conversion method twice in order to get
an accurate conversion for the full 64 bit integer range. The 64-bit integer is split in a 32-bit low and a 32-bit high part
,similarly as in the answers to this question: How to perform uint32/float conversion with SSE? .
Each of these pieces is suitable for Mysticial's integer to double conversion.
Finally the high part is multiplied by 2^32 and added to the low part.
The signed conversion is a little bit more complicted than the unsigned conversion (uint64_to_double_full_range()),
because srai_epi64() doesn't exist.
Code:
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
/*
gcc -O3 -Wall -m64 -mfma -mavx2 -march=broadwell cvt_int_64_double.c
./a.out A
time ./a.out B
time ./a.out C
etc.
*/
inline __m256d uint64_to_double256(__m256i x){ /* Mysticial's fast uint64_to_double. Works for inputs in the range: [0, 2^52) */
x = _mm256_or_si256(x, _mm256_castpd_si256(_mm256_set1_pd(0x0010000000000000)));
return _mm256_sub_pd(_mm256_castsi256_pd(x), _mm256_set1_pd(0x0010000000000000));
}
inline __m256d int64_to_double256(__m256i x){ /* Mysticial's fast int64_to_double. Works for inputs in the range: (-2^51, 2^51) */
x = _mm256_add_epi64(x, _mm256_castpd_si256(_mm256_set1_pd(0x0018000000000000)));
return _mm256_sub_pd(_mm256_castsi256_pd(x), _mm256_set1_pd(0x0018000000000000));
}
__m256d int64_to_double_full_range(const __m256i v)
{
__m256i msk_lo =_mm256_set1_epi64x(0xFFFFFFFF);
__m256d cnst2_32_dbl =_mm256_set1_pd(4294967296.0); /* 2^32 */
__m256i v_lo = _mm256_and_si256(v,msk_lo); /* extract the 32 lowest significant bits of v */
__m256i v_hi = _mm256_srli_epi64(v,32); /* 32 most significant bits of v. srai_epi64 doesn't exist */
__m256i v_sign = _mm256_srai_epi32(v,32); /* broadcast sign bit to the 32 most significant bits */
v_hi = _mm256_blend_epi32(v_hi,v_sign,0b10101010); /* restore the correct sign of v_hi */
__m256d v_lo_dbl = int64_to_double256(v_lo); /* v_lo is within specified range of int64_to_double */
__m256d v_hi_dbl = int64_to_double256(v_hi); /* v_hi is within specified range of int64_to_double */
v_hi_dbl = _mm256_mul_pd(cnst2_32_dbl,v_hi_dbl); /* _mm256_mul_pd and _mm256_add_pd may compile to a single fma instruction */
return _mm256_add_pd(v_hi_dbl,v_lo_dbl); /* rounding occurs if the integer doesn't exist as a double */
}
__m256d int64_to_double_based_on_cvtsi2sd(const __m256i v)
{ __m128d zero = _mm_setzero_pd(); /* to avoid uninitialized variables in_mm_cvtsi64_sd */
__m128i v_lo = _mm256_castsi256_si128(v);
__m128i v_hi = _mm256_extracti128_si256(v,1);
__m128d v_0 = _mm_cvtsi64_sd(zero,_mm_cvtsi128_si64(v_lo));
__m128d v_2 = _mm_cvtsi64_sd(zero,_mm_cvtsi128_si64(v_hi));
__m128d v_1 = _mm_cvtsi64_sd(zero,_mm_extract_epi64(v_lo,1));
__m128d v_3 = _mm_cvtsi64_sd(zero,_mm_extract_epi64(v_hi,1));
__m128d v_01 = _mm_unpacklo_pd(v_0,v_1);
__m128d v_23 = _mm_unpacklo_pd(v_2,v_3);
__m256d v_dbl = _mm256_castpd128_pd256(v_01);
v_dbl = _mm256_insertf128_pd(v_dbl,v_23,1);
return v_dbl;
}
__m256d uint64_to_double_full_range(const __m256i v)
{
__m256i msk_lo =_mm256_set1_epi64x(0xFFFFFFFF);
__m256d cnst2_32_dbl =_mm256_set1_pd(4294967296.0); /* 2^32 */
__m256i v_lo = _mm256_and_si256(v,msk_lo); /* extract the 32 lowest significant bits of v */
__m256i v_hi = _mm256_srli_epi64(v,32); /* 32 most significant bits of v */
__m256d v_lo_dbl = uint64_to_double256(v_lo); /* v_lo is within specified range of uint64_to_double */
__m256d v_hi_dbl = uint64_to_double256(v_hi); /* v_hi is within specified range of uint64_to_double */
v_hi_dbl = _mm256_mul_pd(cnst2_32_dbl,v_hi_dbl);
return _mm256_add_pd(v_hi_dbl,v_lo_dbl); /* rounding may occur for inputs >2^52 */
}
int main(int argc, char **argv){
int i;
uint64_t j;
__m256i j_4, j_inc;
__m256d v, v_acc;
double x[4];
char test = argv[1][0];
if (test=='A'){ /* test the conversions for several integer values */
j = 1ull;
printf("\nint64_to_double_full_range\n");
for (i = 0; i<30; i++){
j_4= _mm256_set_epi64x(j-3,j+3,-j,j);
v = int64_to_double_full_range(j_4);
_mm256_storeu_pd(x,v);
printf("j =%21li v =%23.1f -v=%23.1f v+3=%23.1f v-3=%23.1f \n",j,x[0],x[1],x[2],x[3]);
j = j*7ull;
}
j = 1ull;
printf("\nint64_to_double_based_on_cvtsi2sd\n");
for (i = 0; i<30; i++){
j_4= _mm256_set_epi64x(j-3,j+3,-j,j);
v = int64_to_double_based_on_cvtsi2sd(j_4);
_mm256_storeu_pd(x,v);
printf("j =%21li v =%23.1f -v=%23.1f v+3=%23.1f v-3=%23.1f \n",j,x[0],x[1],x[2],x[3]);
j = j*7ull;
}
j = 1ull;
printf("\nuint64_to_double_full_range\n");
for (i = 0; i<30; i++){
j_4= _mm256_set_epi64x(j-3,j+3,j,j);
v = uint64_to_double_full_range(j_4);
_mm256_storeu_pd(x,v);
printf("j =%21lu v =%23.1f v+3=%23.1f v-3=%23.1f \n",j,x[0],x[2],x[3]);
j = j*7ull;
}
}
else{
j_4 = _mm256_set_epi64x(-123,-4004,-312313,-23412731);
j_inc = _mm256_set_epi64x(1,1,1,1);
v_acc = _mm256_setzero_pd();
switch(test){
case 'B' :{
printf("\nLatency int64_to_double_cvtsi2sd()\n"); /* simple test to get a rough idea of the latency of int64_to_double_cvtsi2sd() */
for (i = 0; i<1000000000; i++){
v =int64_to_double_based_on_cvtsi2sd(j_4);
j_4= _mm256_castpd_si256(v); /* cast without conversion, use output as an input in the next step */
}
_mm256_storeu_pd(x,v);
}
break;
case 'C' :{
printf("\nLatency int64_to_double_full_range()\n"); /* simple test to get a rough idea of the latency of int64_to_double_full_range() */
for (i = 0; i<1000000000; i++){
v = int64_to_double_full_range(j_4);
j_4= _mm256_castpd_si256(v);
}
_mm256_storeu_pd(x,v);
}
break;
case 'D' :{
printf("\nThroughput int64_to_double_cvtsi2sd()\n"); /* simple test to get a rough idea of the throughput of int64_to_double_cvtsi2sd() */
for (i = 0; i<1000000000; i++){
j_4 = _mm256_add_epi64(j_4,j_inc); /* each step a different input */
v = int64_to_double_based_on_cvtsi2sd(j_4);
v_acc = _mm256_xor_pd(v,v_acc); /* use somehow the results */
}
_mm256_storeu_pd(x,v_acc);
}
break;
case 'E' :{
printf("\nThroughput int64_to_double_full_range()\n"); /* simple test to get a rough idea of the throughput of int64_to_double_full_range() */
for (i = 0; i<1000000000; i++){
j_4 = _mm256_add_epi64(j_4,j_inc);
v = int64_to_double_full_range(j_4);
v_acc = _mm256_xor_pd(v,v_acc);
}
_mm256_storeu_pd(x,v_acc);
}
break;
default : {}
}
printf("v =%23.1f -v =%23.1f v =%23.1f -v =%23.1f \n",x[0],x[1],x[2],x[3]);
}
return 0;
}
The actual performance of these functions may depend on the surrounding code and the cpu generation.
Timing results for 1e9 conversions (256 bit wide) with simple tests B, C, D, and E in the code above, on an intel skylake i5 6500 system:
Latency experiment int64_to_double_based_on_cvtsi2sd() (test B) 5.02 sec.
Latency experiment int64_to_double_full_range() (test C) 3.77 sec.
Throughput experiment int64_to_double_based_on_cvtsi2sd() (test D) 2.82 sec.
Throughput experiment int64_to_double_full_range() (test E) 1.07 sec.
The difference in throughput between int64_to_double_full_range() and int64_to_double_based_on_cvtsi2sd() is larger than I expected.
Thanks #mysticial and #wim for the full-range i64->f64. I came up with a full-range truncating f64->i64 for the Highway SIMD wrapper.
The first version tried to change the rounding mode, but Clang reorders them and ignores asm volatile, memory/cc clobbers, and even atomic fence. It's not clear to me how to make that safe; NOINLINE works but causes lots of spilling.
A second version (Compiler Explorer link) emulates FP renormalization and turns out to be faster according to llvm-mca (8-10 cycles rthroughput/total).
// Full-range F64 -> I64 conversion
#include <hwy/highway.h>
namespace hwy {
namespace HWY_NAMESPACE {
HWY_API Vec256<int64_t> I64FromF64(Full256<int64_t> di, const Vec256<double> v) {
const RebindToFloat<decltype(di)> dd;
using VD = decltype(v);
using VI = decltype(Zero(di));
const VI k0 = Zero(di);
const VI k1 = Set(di, 1);
const VI k51 = Set(di, 51);
// Exponent indicates whether the number can be represented as int64_t.
const VI biased_exp = ShiftRight<52>(BitCast(di, v)) & Set(di, 0x7FF);
const VI exp = biased_exp - Set(di, 0x3FF);
const auto in_range = exp < Set(di, 63);
// If we were to cap the exponent at 51 and add 2^52, the number would be in
// [2^52, 2^53) and mantissa bits could be read out directly. We need to
// round-to-0 (truncate), but changing rounding mode in MXCSR hits a
// compiler reordering bug: https://gcc.godbolt.org/z/4hKj6c6qc . We instead
// manually shift the mantissa into place (we already have many of the
// inputs anyway).
const VI shift_mnt = Max(k51 - exp, k0);
const VI shift_int = Max(exp - k51, k0);
const VI mantissa = BitCast(di, v) & Set(di, (1ULL << 52) - 1);
// Include implicit 1-bit; shift by one more to ensure it's in the mantissa.
const VI int52 = (mantissa | Set(di, 1ULL << 52)) >> (shift_mnt + k1);
// For inputs larger than 2^52, insert zeros at the bottom.
const VI shifted = int52 << shift_int;
// Restore the one bit lost when shifting in the implicit 1-bit.
const VI restored = shifted | ((mantissa & k1) << (shift_int - k1));
// Saturate to LimitsMin (unchanged when negating below) or LimitsMax.
const VI sign_mask = BroadcastSignBit(BitCast(di, v));
const VI limit = Set(di, LimitsMax<int64_t>()) - sign_mask;
const VI magnitude = IfThenElse(in_range, restored, limit);
// If the input was negative, negate the integer (two's complement).
return (magnitude ^ sign_mask) - sign_mask;
}
void Test(const double* pd, int64_t* pi) {
Full256<int64_t> di;
Full256<double> dd;
for (size_t i = 0; i < 256; i += Lanes(di)) {
Store(I64FromF64(di, Load(dd, pd + i)), di, pi + i);
}
}
}
}
If anyone sees any potential for simplifying the algorithm, please leave a comment.

Square root of a OpenCV's grey image using SSE

given a grey cv::Mat (CV_8UC1) I want to return another cv::Mat containing the square root of the elements (CV_32FC1) and I want to do it with SSE2 intrinsics. I am having some problems with the conversion from 8-bit values to 32 float values to perform the square root. I would really appreciate any help. This is my code for now(it does not give correct values):
uchar *source = (uchar *)cv::alignPtr(image.data, 16);
float *sqDataPtr = cv::alignPtr((float *)Squared.data, 16);
for (x = 0; x < (pixels - 16); x += 16) {
__m128i a0 = _mm_load_si128((__m128i *)(source + x));
__m128i first8 = _mm_unpacklo_epi8(a0, _mm_set1_epi8(0));
__m128i last8 = _mm_unpackhi_epi8(a0, _mm_set1_epi8(0));
__m128i first4i = _mm_unpacklo_epi16(first8, _mm_set1_epi16(0));
__m128i second4i = _mm_unpackhi_epi16(first8, _mm_set1_epi16(0));
__m128 first4 = _mm_cvtepi32_ps(first4i);
__m128 second4 = _mm_cvtepi32_ps(second4i);
__m128i third4i = _mm_unpacklo_epi16(last8, _mm_set1_epi16(0));
__m128i fourth4i = _mm_unpackhi_epi16(last8, _mm_set1_epi16(0));
__m128 third4 = _mm_cvtepi32_ps(third4i);
__m128 fourth4 = _mm_cvtepi32_ps(fourth4i);
// Store
_mm_store_ps(sqDataPtr + x, _mm_sqrt_ps(first4));
_mm_store_ps(sqDataPtr + x + 4, _mm_sqrt_ps(second4));
_mm_store_ps(sqDataPtr + x + 8, _mm_sqrt_ps(third4));
_mm_store_ps(sqDataPtr + x + 12, _mm_sqrt_ps(fourth4));
}
The SSE code looks OK, except that you're not processing the last 16 pixels:
for (x = 0; x < (pixels - 16); x += 16)
should be:
for (x = 0; x <= (pixels - 16); x += 16)
Note that if your image width is not a multiple of 16 then you will need to take care of any remaining pixels after the last full vector.
Also note that you are taking the sqrt of values in the range 0..255. It may be that you want normalised value in the range 0..1.0, in which case you'll want to scale the values accordingly.
I have no experience with SSE2, but I think that if performance is the issue you should use look-up table. Creation of look-up table is fast since you have only 256 possible values. Copy 4 bytes from look-up table into destination matrix should be a very efficient operation.

packing 10 bit values into a byte stream with SIMD [duplicate]

This question already has answers here:
Keep only the 10 useful bits in 16-bit words
(2 answers)
Closed 2 years ago.
I'm trying to packing 10 bit pixels in to a continuous byte stream, using SIMD instructions. The code below does it "in principle" but the SIMD version is slower than the scalar version.
The problem seem to be that I can't find good gather/scatter operations that load the register efficiently.
Any suggestions for improvement?
// SIMD_test.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "Windows.h"
#include <tmmintrin.h>
#include <stdint.h>
#include <string.h>
// reference non-SIMD implementation that "works"
// 4 uint16 at a time as input, and 5 uint8 as output per loop iteration
void packSlow(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
for(uint32_t j=0;j<NCOL;j+=4)
{
streamBuffer[0] = (uint8_t)(ptr[0]);
streamBuffer[1] = (uint8_t)(((ptr[0]&0x3FF)>>8) | ((ptr[1]&0x3F) <<2));
streamBuffer[2] = (uint8_t)(((ptr[1]&0x3FF)>>6) | ((ptr[2]&0x0F) <<4));
streamBuffer[3] = (uint8_t)(((ptr[2]&0x3FF)>>4) | ((ptr[3]&0x03) <<6));
streamBuffer[4] = (uint8_t)((ptr[3]&0x3FF)>>2) ;
streamBuffer += 5;
ptr += 4;
}
}
// poorly written SIMD implementation. Attempts to do the same
// as the packSlow, but 8 iterations at a time
void packFast(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
const __m128i maska = _mm_set_epi16(0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF);
const __m128i maskb = _mm_set_epi16(0x3F,0x3F,0x3F,0x3F,0x3F,0x3F,0x3F,0x3F);
const __m128i maskc = _mm_set_epi16(0x0F,0x0F,0x0F,0x0F,0x0F,0x0F,0x0F,0x0F);
const __m128i maskd = _mm_set_epi16(0x03,0x03,0x03,0x03,0x03,0x03,0x03,0x03);
for(uint32_t j=0;j<NCOL;j+=4*8)
{
_mm_prefetch((const char*)(ptr+j),_MM_HINT_T0);
}
for(uint32_t j=0;j<NCOL;j+=4*8)
{
// this "fetch" stage is costly. Each term takes 2 cycles
__m128i ptr0 = _mm_set_epi16(ptr[0],ptr[4],ptr[8],ptr[12],ptr[16],ptr[20],ptr[24],ptr[28]);
__m128i ptr1 = _mm_set_epi16(ptr[1],ptr[5],ptr[9],ptr[13],ptr[17],ptr[21],ptr[25],ptr[29]);
__m128i ptr2 = _mm_set_epi16(ptr[2],ptr[6],ptr[10],ptr[14],ptr[18],ptr[22],ptr[26],ptr[30]);
__m128i ptr3 = _mm_set_epi16(ptr[3],ptr[7],ptr[11],ptr[15],ptr[19],ptr[23],ptr[27],ptr[31]);
// I think this part is fairly well optimized
__m128i streamBuffer0 = ptr0;
__m128i streamBuffer1 = _mm_or_si128(_mm_srl_epi16 (_mm_and_si128 (ptr0 , maska), _mm_set_epi32(0, 0, 0,8)) , _mm_sll_epi16 (_mm_and_si128 (ptr1 , maskb) , _mm_set_epi32(0, 0, 0,2)));
__m128i streamBuffer2 = _mm_or_si128(_mm_srl_epi16 (_mm_and_si128 (ptr1 , maska), _mm_set_epi32(0, 0, 0,6)) , _mm_sll_epi16 (_mm_and_si128 (ptr2 , maskc) , _mm_set_epi32(0, 0, 0,4)));
__m128i streamBuffer3 = _mm_or_si128(_mm_srl_epi16 (_mm_and_si128 (ptr2 , maska), _mm_set_epi32(0, 0, 0,4)) , _mm_sll_epi16 (_mm_and_si128 (ptr3 , maskd) , _mm_set_epi32(0, 0, 0,6)));
__m128i streamBuffer4 = _mm_srl_epi16 (_mm_and_si128 (ptr3 , maska), _mm_set_epi32(0, 0, 0,2)) ;
// this again is terribly slow. ~2 cycles per byte output
for(int j=15;j>=0;j-=2)
{
streamBuffer[0] = streamBuffer0.m128i_u8[j];
streamBuffer[1] = streamBuffer1.m128i_u8[j];
streamBuffer[2] = streamBuffer2.m128i_u8[j];
streamBuffer[3] = streamBuffer3.m128i_u8[j];
streamBuffer[4] = streamBuffer4.m128i_u8[j];
streamBuffer += 5;
}
ptr += 32;
}
}
int _tmain(int argc, _TCHAR* argv[])
{
uint16_t pixels[512];
uint8_t packed1[512*10/8];
uint8_t packed2[512*10/8];
for(int i=0;i<512;i++)
{
pixels[i] = i;
}
LARGE_INTEGER t0,t1,t2;
QueryPerformanceCounter(&t0);
for(int k=0;k<1000;k++) packSlow(pixels,packed1,512);
QueryPerformanceCounter(&t1);
for(int k=0;k<1000;k++) packFast(pixels,packed2,512);
QueryPerformanceCounter(&t2);
printf("%d %d\n",t1.QuadPart-t0.QuadPart,t2.QuadPart-t1.QuadPart);
if (memcmp(packed1,packed2,sizeof(packed1)))
{
printf("failed\n");
}
return 0;
}
On re-reading your code, it looks like you are almost definitely murdering your load/store unit, which wouldn't even get complete relief with the new AVX2 VGATHER[D/Q]P[D/S] instruction family. Even Haswell's architecture still requires a uop per load element, each hitting the L1D TLB and cache, regardless of locality, with efficiency improvements showing in Skylake ca. 2016 at earliest.
Your best recourse at present is probably to do 16B register reads and manually construct your streamBuffer values with register copies, _mm_shuffle_epi8(), and _mm_or_si128() calls, and the inverse for the finishing stores.
In the near future, AVX2 will provide (and does for newer desktops already) VPS[LL/RL/RA]V[D/Q] instructions that allow variable element shifting that, combined with a horizontal add, could do this packing pretty quickly. In this case, you could use simple MOVDQU instructions for loading your values, since you could process contiguous uint16_t input values in a single xmm register.
Also, consider reworking your prefetching. Your j in NCOL loop is processing 64B/1 cache line at a time, so you should probably do a single prefetch for ptr + 32 at the beginning of your second loop's body. You might even consider omitting it, since it's a simple forward scan that the hardware prefetcher will detect and automate for you after a very small number of iterations anyway.
I have no experience specifically in SSE. But I would have tried to optimize the code as follows.
// warning. This routine requires streamBuffer to have at least 3 extra spare bytes
// at the end to be used as scratch space. It will write 0's to those bytes.
// for example, streamBuffer needs to be 640+3 bytes of allocated memory if
// 512 10-bit samples are output.
void packSlow1(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
for(uint32_t j=0;j<NCOL;j+=4*4)
{
uint64_t *dst;
uint64_t src[4][4];
// __m128i s01 = _mm_set_epi64(ptr[0], ptr[1]);
// __m128i s23 = _mm_set_epi64(ptr[2], ptr[3]);
// ---- or ----
// __m128i s0123 = _mm_load_si128(ptr[0])
// __m128i s01 = _?????_(s0123) // some instruction to extract s01 from s0123
// __m128i s23 = _?????_(s0123) // some instruction to extract s23
src[0][0] = ptr[0] & 0x3ff;
src[0][1] = ptr[1] & 0x3ff;
src[0][2] = ptr[2] & 0x3ff;
src[0][3] = ptr[3] & 0x3ff;
src[1][0] = ptr[4] & 0x3ff;
src[1][1] = ptr[5] & 0x3ff;
src[1][2] = ptr[6] & 0x3ff;
src[1][3] = ptr[7] & 0x3ff;
src[2][0] = ptr[8] & 0x3ff;
src[2][1] = ptr[9] & 0x3ff;
src[2][2] = ptr[10] & 0x3ff;
src[2][3] = ptr[11] & 0x3ff;
src[3][0] = ptr[12] & 0x3ff;
src[3][1] = ptr[13] & 0x3ff;
src[3][2] = ptr[14] & 0x3ff;
src[3][3] = ptr[15] & 0x3ff;
// looks like _mm_maskmoveu_si128 can store result efficiently
dst = (uint64_t*)streamBuffer;
dst[0] = src[0][0] | (src[0][1] << 10) | (src[0][2] << 20) | (src[0][3] << 30);
dst = (uint64_t*)(streamBuffer + 5);
dst[0] = src[1][0] | (src[1][1] << 10) | (src[1][2] << 20) | (src[1][3] << 30);
dst = (uint64_t*)(streamBuffer + 10);
dst[0] = src[2][0] | (src[2][1] << 10) | (src[2][2] << 20) | (src[2][3] << 30);
dst = (uint64_t*)(streamBuffer + 15);
dst[0] = src[3][0] | (src[3][1] << 10) | (src[3][2] << 20) | (src[3][3] << 30);
streamBuffer += 5 * 4;
ptr += 4 * 4;
}
}
UPDATE:
Benchmarks:
Ubuntu 12.04, x86_64 GNU/Linux, gcc v4.6.3 (Virtual Box)
Intel Core i7 (Macbook pro)
compiled with -O3
5717633386 (1X): packSlow
3868744491 (1.4X): packSlow1 (version from the post)
4471858853 (1.2X): packFast2 (from Mark Lakata's post)
1820784764 (3.1X): packFast3 (version from the post)
Windows 8.1, x64, VS2012 Express
Intel Core i5 (Asus)
compiled with standard 'Release' options and SSE2 enabled
00413185 (1X) packSlow
00782005 (0.5X) packSlow1
00236639 (1.7X) packFast2
00148906 (2.8X) packFast3
I see completely different results on Asus notebook with Windows 8.1 and VS Express 2012 (code compiled with -O2). packSlow1 is 2x slower than original packSlow, while packFast2 is 1.7X (not 2.9X) faster than packSlow. After researching this problem, I understood the reason. VC compiler was unable to save all the constants into XMMS registers for packFast2 , so it inserted additional memory accesses into the loop (see generated assembly). Slow memory access explains performance degradation.
In order to get more stable results I increased pixels buffer to 256x512 and increased loop counter from 1000 to 10000000/256.
Here is my version of SSE optimized function.
// warning. This routine requires streamBuffer to have at least 3 extra spare bytes
// at the end to be used as scratch space. It will write 0's to those bytes.
// for example, streamBuffer needs to be 640+3 bytes of allocated memory if
// 512 10-bit samples are output.
void packFast3(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
const __m128i m0 = _mm_set_epi16(0, 0x3FF, 0, 0x3FF, 0, 0x3FF, 0, 0x3FF);
const __m128i m1 = _mm_set_epi16(0x3FF, 0, 0x3FF, 0, 0x3FF, 0, 0x3FF, 0);
const __m128i m2 = _mm_set_epi32(0, 0xFFFFFFFF, 0, 0xFFFFFFFF);
const __m128i m3 = _mm_set_epi32(0xFFFFFFFF, 0, 0xFFFFFFFF, 0);
const __m128i m4 = _mm_set_epi32(0, 0, 0xFFFFFFFF, 0xFFFFFFFF);
const __m128i m5 = _mm_set_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0, 0);
__m128i s0, t0, r0, x0, x1;
// unrolled and normal loop gives the same result
for(uint32_t j=0;j<NCOL;j+=8)
{
// load 8 samples into s0
s0 = _mm_loadu_si128((__m128i*)ptr); // s0=00070006_00050004_00030002_00010000
// join 16-bit samples into 32-bit words
x0 = _mm_and_si128(s0, m0); // x0=00000006_00000004_00000002_00000000
x1 = _mm_and_si128(s0, m1); // x1=00070000_00050000_00030000_00010000
t0 = _mm_or_si128(x0, _mm_srli_epi32(x1, 6)); // t0=00001c06_00001404_00000c02_00000400
// join 32-bit words into 64-bit dwords
x0 = _mm_and_si128(t0, m2); // x0=00000000_00001404_00000000_00000400
x1 = _mm_and_si128(t0, m3); // x1=00001c06_00000000_00000c02_00000000
t0 = _mm_or_si128(x0, _mm_srli_epi64(x1, 12)); // t0=00000001_c0601404_00000000_c0200400
// join 64-bit dwords
x0 = _mm_and_si128(t0, m4); // x0=00000000_00000000_00000000_c0200400
x1 = _mm_and_si128(t0, m5); // x1=00000001_c0601404_00000000_00000000
r0 = _mm_or_si128(x0, _mm_srli_si128(x1, 3)); // r0=00000000_000001c0_60140400_c0200400
// and store result
_mm_storeu_si128((__m128i*)streamBuffer, r0);
streamBuffer += 10;
ptr += 8;
}
}
I came up with a "better" solution using SIMD, but it doesn't not leverage parallelization, just more efficient loads and stores (I think).
I'm posting it here for reference, not necessarily the best answer.
The benchmarks are (in arbitrary ticks)
gcc4.8.1 -O3 VS2012 /O2 Implementation
-----------------------------------------
369 (1X) 3394 (1X) packSlow (original code)
212 (1.7X) 2010 (1.7X) packSlow (from #alexander)
147 (2.5X) 1178 (2.9X) packFast2 (below)
Here's the code. Essentially #alexander's code except using 128 bit registers instead of 64 bit registers, and unrolled 2x instead of 4x.
void packFast2(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
const __m128i maska = _mm_set_epi16(0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF);
const __m128i mask0 = _mm_set_epi16(0,0,0,0,0,0,0,0x3FF);
const __m128i mask1 = _mm_set_epi16(0,0,0,0,0,0,0x3FF,0);
const __m128i mask2 = _mm_set_epi16(0,0,0,0,0,0x3FF,0,0);
const __m128i mask3 = _mm_set_epi16(0,0,0,0,0x3FF,0,0,0);
const __m128i mask4 = _mm_set_epi16(0,0,0,0x3FF,0,0,0,0);
const __m128i mask5 = _mm_set_epi16(0,0,0x3FF,0,0,0,0,0);
const __m128i mask6 = _mm_set_epi16(0,0x3FF,0,0,0,0,0,0);
const __m128i mask7 = _mm_set_epi16(0x3FF,0,0,0,0,0,0,0);
for(uint32_t j=0;j<NCOL;j+=16)
{
__m128i s = _mm_load_si128((__m128i*)ptr); // load 8 16 bit values
__m128i s2 = _mm_load_si128((__m128i*)(ptr+8)); // load 8 16 bit values
__m128i a = _mm_and_si128(s,mask0);
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask1),6));
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask2),12));
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask3),18));
a = _mm_or_si128( a, _mm_srli_si128 (_mm_and_si128(s, mask4),24/8)); // special shift 24 bits to the right, staddling the middle. luckily use just on 128 byte shift (24/8)
a = _mm_or_si128( a, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s, mask5),6),24/8)); // special. shift net 30 bits. first shift 6 bits, then 3 bytes.
a = _mm_or_si128( a, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s, mask6),4),32/8)); // special. shift net 36 bits. first shift 4 bits, then 4 bytes (32 bits).
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask7),42));
_mm_storeu_si128((__m128i*)streamBuffer, a);
__m128i a2 = _mm_and_si128(s2,mask0);
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask1),6));
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask2),12));
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask3),18));
a2 = _mm_or_si128( a2, _mm_srli_si128 (_mm_and_si128(s2, mask4),24/8)); // special shift 24 bits to the right, staddling the middle. luckily use just on 128 byte shift (24/8)
a2 = _mm_or_si128( a2, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s2, mask5),6),24/8)); // special. shift net 30 bits. first shift 6 bits, then 3 bytes.
a2 = _mm_or_si128( a2, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s2, mask6),4),32/8)); // special. shift net 36 bits. first shift 4 bits, then 4 bytes (32 bits).
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask7),42));
_mm_storeu_si128((__m128i*)(streamBuffer+10), a2);
streamBuffer += 20 ;
ptr += 16 ;
}
}

Horizontal minimum and maximum using SSE

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.
I have been using the following implementation for the minimum for instance:
static inline int16_t hMin(__m128i buffer) {
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
return ((int8_t*) ((void *) &buffer))[0];
}
I need to compute the minimum and the maximum of 16 1-byte integers, as you see.
Any good suggestions are highly appreciated :)
Thanks
SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.
If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
Use _mm_minpos_epu16 to find minimum among these values.
Extract the resulting minimum value with _mm_cvtsi128_si32.
Undo effect of step 1 to get the original byte value.
Here is an example that returns maximum of 16 signed bytes:
static inline int16_t hMax(__m128i buffer)
{
__m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
__m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
__m128i tmp3 = _mm_minpos_epu16(tmp2);
return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}
I suggest two changes:
Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:
static inline int16_t hMin(__m128i buffer)
{
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
return (int8_t)_mm_cvtsi128_si32(buffer);
}
here's an implementation without shuffle, shuffle is slow on AMD 5000 Ryzen 7 for some reason
float max_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_max_ps(a, b); // ..., max(x, z), ..., ...
Vector4 res = _mm_max_ps(mm, c); // ..., max(y, max(x, z)), ..., ...
return res.y;
}
float min_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_min_ps(a, b); // ..., min(x, z), ..., ...
Vector4 res = _mm_min_ps(mm, c); // ..., min(y, min(x, z)), ..., ...
return res.y;
}

getting error SIMD operation

I want to compute for k=0 to k=100
A[j][k]=((A[j][k]-con*A[r][k])%2);
for that I am storing (con*A[r][k]) in some int temp[5]
and then doing A[j][k]-temp[] in SIMD whats wrong in the code below its giving segmentation fault for line __m128i m5=_mm_sub_epi32(*m3,*m4);
while((k+4)<100)
{
__m128i *m3 = (__m128i*)A[j+k];
temp[0]=con*A[r][k];
temp[1]=con*A[r][k+1];
temp[2]=con*A[r][k+2];
temp[3]=con*A[r][k+3];
__m128i *m4 = (__m128i*)temp;
__m128i m5 =_mm_sub_epi32(*m3,*m4);
(temp_ptr)=(int*)&m5;
printf("%ld,%d,%ld\n",A[j][k],con,A[r][k]);
A[j][k] =temp_ptr[0]%2;
A[j][k+1]=temp_ptr[1]%2;
A[j][k+2]=temp_ptr[2]%2;
A[j][k+3]=temp_ptr[3]%2;
k=k+4;
}
Most likely, you didn't take care of the alignment. SIMD instructions require 16-byte alignment (see this article). Otherwise, your program will crash.
Either alignment, or you have wrong indexes somewhere, and access wrong memory.
Without the possible values for j, k, and r it's hard to tell why, but most likely you are overindexing one of your arrays
If you want to implement:
for (k = 0; k < 100; k += 4)
{
A[j][k] = (A[j][k] - con * A[r][k]) % 2;
}
and you want to see some benefit from SIMD, then you need to do it all in SIMD, i.e. don't mix SIMD and scalar code.
For example (untested):
const __m128i vcon = _mm_set1_epi32(con);
const __m128i vk1 = _mm_set1_epi32(1);
for (k = 0; k < 100; k += 4)
{
__m128i v1 = _mm_loadu_si128(&A[j][k]); // load v1 from A[j][k..k+3] (misaligned)
__m128i v2 = _mm_loadu_si128(&A[r][k]); // load v2 from A[r][k..k+3] (misaligned)
v2 = _mm_mullo_epi32(v2, vcon); // v2 = con * A[r][k..k+3]
v1 = _mm_sub_epi32(v1, v2); // v1 = A[j][k..k+3] - con * A[r][k..k+3]
v1 = _mm_and_si128(v1, vk1); // v1 = (A[j][k..k+3] - con * A[r][k..k+3]) % 2
_mm_storeu_si128(&A[j][k], v1); // store v1 back to A[j][k..k+3] (misaligned)
}
Note: if you can guarantee that each row of A is 16 byte aligned then you can change the misaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128) to aligned loads/stores (_mm_load_si128/_mm_store_si128) - this will help performance somewhat, depending on what CPU you are targetting.