Getting min short value in a __m128i vector with SSE? - c++

This question seems similar to Getting max value in a __m128i vector with SSE? but with shorts and minimum instead of integer + maximum. This is what I came up with:
typedef short int weight;
weight horizontal_min_Vec4i(__m128i x) {
__m128i max1 = _mm_shufflehi_epi16(x, _MM_SHUFFLE(0, 0, 3, 2));
__m128i max1b = _mm_shufflelo_epi16(x, _MM_SHUFFLE(0, 0, 3, 2));
__m128i max2 = _mm_min_epi16(max1, max1b);
//max2 = _mm_min_epi16(max2, x);
max1 = _mm_shufflehi_epi16(max2, _MM_SHUFFLE(0, 0, 0, 1));
max1b = _mm_shufflelo_epi16(max2, _MM_SHUFFLE(0, 0, 0, 1));
__m128i max3 = _mm_min_epi16(max1, max1b);
max2 = _mm_min_epi16(max2, max3);
return min(_mm_extract_epi16(max2, 0), _mm_extract_epi16(max2, 4));
}
The function basically does the same as the answer in https://stackoverflow.com/a/18616825/1500111 for the upper and lower parts of x. So, I know the minimum value is either in the position 0 or 4 of the __m128i variable max2. Although it is much faster than the no SIMD function horizontal_min_Vec4i_Plain(__m128i x) shown below, I am afraid the bottleneck is the _mm_extract_epi16 operation at the last line. Is there a better way to achieve this, for a better speed up? I am using Haswell so I have access to the latest SSE extensions.
weight horizontal_min_Vec4i_Plain(__m128i x) {
weight result[8] __attribute__((aligned(16)));
_mm_store_si128((__m128i *) result, x);
weight myMin = result[0];
for (int l = 1; l < 8; l++) {
if (myMin > result[l]) {
myMin = result[l];
}
}
return myMin;
}

Signed and unsigned comparison are almost the same, except that the range with the top bit set is treated as bigger than the range with the top bit not set in unsigned comparisons, and as smaller in signed comparisons. That means signed and unsigned comparisons can be converted into each other by these rules:
x <s y = (x ^ signbit) <u (y ^ signbit)
x <u y = (x ^ signbit) <s (y ^ signbit)
This property transfers directly to min and max, so:
min_s(x, y) = min_u(x ^ signbit, y ^ signbit) ^ signbit
And then we can use _mm_minpos_epu16 to handle the horizontal minimum, to get, in total, something like
__m128i xs = _mm_xor_si128(x, _mm_set1_epi16(0x8000));
return _mm_extract_epi16(_mm_minpos_epu16(xs), 0) - 0x8000;
The - 0x8000 is ^ 0x8000 and sign-extension (extract zero-extends) rolled into one.

Related

Query points on the vertices of a Hamming cube

I have N points that lie only on the vertices of a cube, of dimension D, where D is something like 3.
A vertex may not contain any point. So every point has coordinates in {0, 1}D. I am only interested in query time, as long as the memory cost is reasonable ( not exponential in N for example :) ).
Given a query that lies on one of the cube's vertices and an input parameter r, find all the vertices (thus points) that have hamming distance <= r with the query.
What's the way to go in a c++ environment?
I am thinking of a kd-tree, but I am not sure and want help, any input, even approximative, would be appreciated! Since hamming distance comes into play, bitwise manipulations should help (e.g. XOR).
There is a nice bithack to go from one bitmask with k bits set to the lexicographically next permutation, which means it's fairly simple to loop through all masks with k bits set. XORing these masks with an initial value gives all the values at hamming distance exactly k away from it.
So for D dimensions, where D is less than 32 (otherwise change the types),
uint32_t limit = (1u << D) - 1;
for (int k = 1; k <= r; k++) {
uint32_t diff = (1u << k) - 1;
while (diff <= limit) {
// v is the input vertex
uint32_t vertex = v ^ diff;
// use it
diff = nextBitPermutation(diff);
}
}
Where nextBitPermutation may be implemented in C++ as something like (if you have __builtin_ctz)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
}
Or for MSVC (not tested)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
unsigned long tzc;
_BitScanForward(&tzc, v); // v != 0 so the return value doesn't matter
return (t + 1) | (((~t & -~t) - 1) >> (tzc + 1));
}
If D is really low, 4 or lower, the old popcnt-with-pshufb works really well and generally everything just lines up well, like this:
uint16_t query(int vertex, int r, int8_t* validmask)
{
// validmask should be array of 16 int8_t's,
// 0 for a vertex that doesn't exist, -1 if it does
__m128i valid = _mm_loadu_si128((__m128i*)validmask);
__m128i t0 = _mm_set1_epi8(vertex);
__m128i r0 = _mm_set1_epi8(r + 1);
__m128i all = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
__m128i popcnt_lut = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
__m128i dist = _mm_shuffle_epi8(popcnt_lut, _mm_xor_si128(t0, all));
__m128i close_enough = _mm_cmpgt_epi8(r0, dist);
__m128i result = _mm_and_si128(close_enough, valid);
return _mm_movemask_epi8(result);
}
This should be fairly fast; fast compared to the bithack above (nextBitPermutation, which is fairly heavy, is used a lot there) and also compared to looping over all vertices and testing whether they are in range (even with builtin popcnt, that automatically takes at least 16 cycles and the above shouldn't, assuming everything is cached or even permanently in a register). The downside is the result is annoying to work with, since it's a mask of which vertices both exist and are in range of the queried point, not a list of them. It would combine well with doing some processing on data associated with the points though.
This also scales down to D=3 of course, just make none of the points >= 8 valid. D>4 can be done similarly but it takes more code then, and since this is really a brute force solution that is only fast due to parallelism it fundamentally gets slower exponentially in D.

Square root of a OpenCV's grey image using SSE

given a grey cv::Mat (CV_8UC1) I want to return another cv::Mat containing the square root of the elements (CV_32FC1) and I want to do it with SSE2 intrinsics. I am having some problems with the conversion from 8-bit values to 32 float values to perform the square root. I would really appreciate any help. This is my code for now(it does not give correct values):
uchar *source = (uchar *)cv::alignPtr(image.data, 16);
float *sqDataPtr = cv::alignPtr((float *)Squared.data, 16);
for (x = 0; x < (pixels - 16); x += 16) {
__m128i a0 = _mm_load_si128((__m128i *)(source + x));
__m128i first8 = _mm_unpacklo_epi8(a0, _mm_set1_epi8(0));
__m128i last8 = _mm_unpackhi_epi8(a0, _mm_set1_epi8(0));
__m128i first4i = _mm_unpacklo_epi16(first8, _mm_set1_epi16(0));
__m128i second4i = _mm_unpackhi_epi16(first8, _mm_set1_epi16(0));
__m128 first4 = _mm_cvtepi32_ps(first4i);
__m128 second4 = _mm_cvtepi32_ps(second4i);
__m128i third4i = _mm_unpacklo_epi16(last8, _mm_set1_epi16(0));
__m128i fourth4i = _mm_unpackhi_epi16(last8, _mm_set1_epi16(0));
__m128 third4 = _mm_cvtepi32_ps(third4i);
__m128 fourth4 = _mm_cvtepi32_ps(fourth4i);
// Store
_mm_store_ps(sqDataPtr + x, _mm_sqrt_ps(first4));
_mm_store_ps(sqDataPtr + x + 4, _mm_sqrt_ps(second4));
_mm_store_ps(sqDataPtr + x + 8, _mm_sqrt_ps(third4));
_mm_store_ps(sqDataPtr + x + 12, _mm_sqrt_ps(fourth4));
}
The SSE code looks OK, except that you're not processing the last 16 pixels:
for (x = 0; x < (pixels - 16); x += 16)
should be:
for (x = 0; x <= (pixels - 16); x += 16)
Note that if your image width is not a multiple of 16 then you will need to take care of any remaining pixels after the last full vector.
Also note that you are taking the sqrt of values in the range 0..255. It may be that you want normalised value in the range 0..1.0, in which case you'll want to scale the values accordingly.
I have no experience with SSE2, but I think that if performance is the issue you should use look-up table. Creation of look-up table is fast since you have only 256 possible values. Copy 4 bytes from look-up table into destination matrix should be a very efficient operation.

Shift masked bits to the lsb

When you and some data with a mask you get some result which is of the same size as the data/mask.
What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.
I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.
And here is an incredible image made in MS Paint.
This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.
Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:
unsigned compress(unsigned x, unsigned m) {
unsigned r, s, b; // Result, shift, mask bit.
r = 0;
s = 0;
do {
b = m & 1;
r = r | ((x & b) << s);
s = s + b;
x = x >> 1;
m = m >> 1;
} while (m != 0);
return r;
}
But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)
unsigned compress(unsigned x, int maskindex) {
unsigned t;
int i;
x = x & masks[maskindex][0];
for (i = 0; i < 5; i++) {
t = x & masks[maskindex][i + 1];
x = x ^ t | (t >> (1 << i));
}
return x;
}
Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.
You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.
For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh
uint8_t compress_maskA9(uint8_t x)
{
const uint8_t mask1 = 0xA9 & 0xF0;
const uint8_t mask2 = 0xA9 & 0x0F;
return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}
In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone
An alternate way with only one multiplication
const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;
I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications
If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly
// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);
or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication
return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000
But there might be a better alternative like the above second snippet
const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28
Correctness check: http://ideone.com/PYUkty
For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically
Explanation
We have abcdefgh & mask1 = a0c00000. Multiply it with magic1
........................a0c00000
× 00000011000000000000000000000000 (magic1 = 0x03000000)
────────────────────────────────
a0c00000........................
+ a0c00000......................... (the leading "a" bit is outside int's range
──────────────────────────────── so it'll be truncated)
r1 = acc.............................
=> (r1 >> 28) & 0x0C = 0000ac00
Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2
........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
────────────────────────────────
e00h............................
+ 0h..............................
────────────────────────────────
r2 = eh..............................
=> (r2 >> 30) = 000000eh
Combine them together we have the expected result
((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh
And here's the demo for the second snippet
abcdefghabcdefgh
& 1000100000100001 (0x8821)
────────────────────────────────
a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
────────────────────────────────
000h
00e00000c0000h
+ 0c0000h
a000e00000c0000h
────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
────────────────────────────────
= aceh
For the reversed order case:
abcdefghabcdefgh
& 0010100010000001 (0x2881)
────────────────────────────────
00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
────────────────────────────────
000a000000h
00c0e000a000000h
+ 0e000a000000h
h
────────────────────────────────
hecaea00a0h0h00h
& 11110000000000000000000000000000
────────────────────────────────
= heca
Related:
How to create a byte out of 8 bool values (and vice versa)?
Redistribute least significant bits from a 4-byte array to a nibble

performance of log10 function returning an int

Today I needed a cheap log10 function, of which I only used the int part. Assuming the result is floored, so the log10 of 999 would be 2. Would it be beneficial writing a function myself? And if so, which way would be the best to go. Assuming the code would not be optimized.
The alternatives to log10 I've though of;
use a for loop dividing or multiplying by 10;
use a string parser(probably extremely expensive);
using an integer log2() function multiplying by a constant.
Thank you on beforehand:)
The operation can be done in (fast) constant time on any architecture that has a count-leading-zeros or similar instruction (which is most architectures). Here's a C snippet I have sitting around to compute the number of digits in base ten, which is essentially the same task (assumes a gcc-like compiler and 32-bit int):
unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
static unsigned int baseTenDigits(unsigned int x) {
static const unsigned char guess[33] = {
0, 0, 0, 0, 1, 1, 1, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9
};
static const unsigned int tenToThe[] = {
1, 10, 100, 1000, 10000, 100000,
1000000, 10000000, 100000000, 1000000000,
};
unsigned int digits = guess[baseTwoDigits(x)];
return digits + (x >= tenToThe[digits]);
}
GCC and clang compile this down to ~10 instructions on x86. With care, one can make it faster still in assembly.
The key insight is to use the (extremely cheap) base-two logarithm to get a fast estimate of the base-ten logarithm; at that point we only need to compare against a single power of ten to decide if we need to adjust the guess. This is much more efficient than searching through multiple powers of ten to find the right one.
If the inputs are overwhelmingly biased to one- and two-digit numbers, a linear scan is sometimes faster; for all other input distributions, this implementation tends to win quite handily.
One way to do it would be loop with subtracting powers of 10. This powers could be computed and stored in table. Here example in python:
table = [10**i for i in range(1, 10)]
# [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]
def fast_log10(n):
for i, k in enumerate(table):
if n - k < 0:
return i
Usage example:
>>> fast_log10(1)
0
>>> fast_log10(10)
1
>>> fast_log10(100)
2
>>> fast_log10(999)
2
fast_log10(1000)
3
Also you may use binary search with this table. Then algorithm complexity would be only O(lg(n)), where n is number of digits.
Here example with binary search in C:
long int table[] = {10, 100, 1000, 10000, 1000000};
#define TABLE_LENGHT sizeof(table) / sizeof(long int)
int bisect_log10(long int n, int s, int e) {
int a = (e - s) / 2 + s;
if(s >= e)
return s;
if((table[a] - n) <= 0)
return bisect_log10(n, a + 1, e);
else
return bisect_log10(n, s, a);
}
int fast_log10(long int n){
return bisect_log10(n, 0, TABLE_LENGHT);
}
Note for small numbers this method would slower then upper method.
Full code here.
Well, there's the old standby - the "poor man's log function".
(If you want to handle more than 63 integer digits, change the first "if" to a "while".)
n = 1;
if (v >= 1e32){n += 32; v /= 1e32;}
if (v >= 1e16){n += 16; v /= 1e16;}
if (v >= 1e8){n += 8; v /= 1e8;}
if (v >= 1e4){n += 4; v /= 1e4;}
if (v >= 1e2){n += 2; v /= 1e2;}
if (v >= 1e1){n += 1; v /= 1e1;}
so if you feed in 123456.7, here's how it goes:
n = 1;
if (v >= 1e32) no
if (v >= 1e16) no
if (v >= 1e8) no
if (v >= 1e4) yes, so n = 5, v = 12.34567
if (v >= 1e2) no
if (v >= 1e1) yes, so n = 6, v = 1.234567
so result is n = 6
Here's a variation that uses multiplication, rather than division:
int n = 1;
double d = 1, temp;
temp = d * 1e32; if (v >= temp){n += 32; d = temp;}
temp = d * 1e16; if (v >= temp){n += 16; d = temp;}
temp = d * 1e8; if (v >= temp){n += 8; d = temp;}
temp = d * 1e4; if (v >= temp){n += 4; d = temp;}
temp = d * 1e2; if (v >= temp){n += 2; d = temp;}
temp = d * 1e1; if (v >= temp){n += 1; d = temp;}
and an execution looks like this
v = 123456.7
n = 1
d = 1
temp = 1e32, if (v >= 1e32) no
temp = 1e16, if (v >= 1e16) no
temp = 1e8, if (v >= 1e8) no
temp = 1e4, if (v >= 1e4) yes, so n = 5, d = 1e4;
temp = 1e6, if (v >= 1e6) no
temp = 1e5, if (v >= 1e5) yes, so n = 6, d = 1e5;
If you want to have a faster log function you need to approximate their result. E.g. the exp function can be approximated using a 'short' taylor approximation. You can find example approximations for exp, log, root and power here
edit:
You can find a short performance comparsion here
Because an unsigned < or >= test is done simply by subtracting and checking the carry flag, it is possible to put both arrays (guess and negated tenToThe) in a single 64-bit value, combine both array lookups into one, and use the carry from 32-bit addition to adjust the guess. The high 32 bits of guess[n] provide the value of log10(2^n*2-1), while the low 32 bits contain -10^log10(2^n*2-1).
static unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
unsigned int baseTenDigits(unsigned int x) {
static uint64_t guess[33] = {
/* 1 */ 0, 0, 0,
/* 8 */ (1ull<<32)-10, (1ull<<32)-10, (1ull<<32)-10,
/* 64 */ (2ull<<32)-100, (2ull<<32)-100, (2ull<<32)-100,
/* 512 */ (3ull<<32)-1000, (3ull<<32)-1000, (3ull<<32)-1000,
(3ull<<32)-1000,
/* 8192 */ (4ull<<32)-10000, (4ull<<32)-10000, (4ull<<32)-10000,
/* 65536 */ (5ull<<32)-100000, (5ull<<32)-100000, (5ull<<32)-100000,
/* 524288 */ (6ull<<32)-1000000, (6ull<<32)-1000000, (6ull<<32)-1000000,
(6ull<<32)-1000000,
/* 8388608 */ (7ull<<32)-10000000, (7ull<<32)-10000000,
(7ull<<32)-10000000,
/* 67108864 */ (8ull<<32)-100000000, (8ull<<32)-100000000,
(8ull<<32)-100000000,
/* 536870912 */ (9ull<<32)-1000000000, (9ull<<32)-1000000000,
(9ull<<32)-1000000000,
};
uint64_t adjust = guess[baseTwoDigits(x)];
return (adjust + x) >> 32;
}
Without any specifications, I will just give a general answer:
The log function will be pretty efficient in most languages as it is such a basic function.
The fact that you are only interested in integers could give you some leverage, but probably this is not enough to easily beat the builtin standard solutions.
One of the few things that I can think of to be faster than a builtin function is a table lookup, so if you are only interested in the numbers upto 10000 for instance, you could simply create a table that you could use to lookup any of these values when you need them.
Obviously this solution will not scale well, but it may be just what you need.
Sidenote: If you are importing the data for example, it may actually be faster to look at the string diecty length (rather than first converting the string to a number and than looking at the value of the string). Of course this will require the input to be stored in just the right format, otherwise it won't gain you anything.

Horizontal minimum and maximum using SSE

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.
I have been using the following implementation for the minimum for instance:
static inline int16_t hMin(__m128i buffer) {
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
return ((int8_t*) ((void *) &buffer))[0];
}
I need to compute the minimum and the maximum of 16 1-byte integers, as you see.
Any good suggestions are highly appreciated :)
Thanks
SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.
If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
Use _mm_minpos_epu16 to find minimum among these values.
Extract the resulting minimum value with _mm_cvtsi128_si32.
Undo effect of step 1 to get the original byte value.
Here is an example that returns maximum of 16 signed bytes:
static inline int16_t hMax(__m128i buffer)
{
__m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
__m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
__m128i tmp3 = _mm_minpos_epu16(tmp2);
return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}
I suggest two changes:
Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:
static inline int16_t hMin(__m128i buffer)
{
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
return (int8_t)_mm_cvtsi128_si32(buffer);
}
here's an implementation without shuffle, shuffle is slow on AMD 5000 Ryzen 7 for some reason
float max_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_max_ps(a, b); // ..., max(x, z), ..., ...
Vector4 res = _mm_max_ps(mm, c); // ..., max(y, max(x, z)), ..., ...
return res.y;
}
float min_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_min_ps(a, b); // ..., min(x, z), ..., ...
Vector4 res = _mm_min_ps(mm, c); // ..., min(y, min(x, z)), ..., ...
return res.y;
}