Find min/max value from a __m128i - c++

I want to find the minimum/maximum value into an array of byte using SIMD operations. So far I was able to go through the array and store the minimum/maximum value into a __m128i variable, but it means that the value I am looking for is mixed among others (15 others to be exact).
I've found these discussions here and here for integer, and this page for float, but I don't understand how works _mm_shuffle*. So my questions are:
What SIMD operations do I have to perform in order to extract the minimum / maximum byte (or unsigned byte) value from the __m128i variable?
How does _mm_shuffle* work? I don't get it when I look to the "minimal" documentation online. I know it is related to the _MM_SHUFFLE macro, but I don't get the example.

Here is an example for horizontal max for uint8_t:
#include "tmmintrin.h" // requires SSSE3
__m128i _mm_hmax_epu8(const __m128i v)
{
__m128i vmax = v;
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 1));
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 2));
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 4));
vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 8));
return vmax;
}
The max value will be returned in all elements. If you need the value as a scalar then use _mm_extract_epi8.
It should be fairly obvious how to adapt this for min, and for signed min/max.

Alternatively, convert to words and use phminposuw (not tested)
int hminu8(__m128i x)
{
__m128i l = _mm_unpacklo_epi8(x, _mm_setzero_si128());
__m128i h = _mm_unpackhi_epi8(x, _mm_setzero_si128());
l = _mm_minpos_epu16(l);
h = _mm_minpos_epu16(h);
return _mm_extract_epi16(_mm_min_epu16(l, h), 0);
}
By my quick count, the latency is a bit worse than a min/shuffle cascade, but the throughput a bit better. The linked answer with phminposuw is probably better though. Adapted for unsigned bytes (but not tested)
uint8_t hminu8(__m128i x)
{
x = _mm_min_epu8(x, _mm_srli_epi16(x, 8));
x = _mm_minpos_epu16(x);
return _mm_cvtsi128_si32(x);
}
You could use it for max too, but with a bit of overhead: complement the input and result.

Related

can i speed up more than _mm256_i32gather_epi32

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?
Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

Why two consecutive gather instruction perform worse than equivalent elementary ops?

I am upgrading some code from SSE to AVX2. In general I can see that gather instructions are quite useful and benefit performance. However I encountered a case where gather instructions are less efficient than decomposing the gather operations into simpler ones.
In the code below, I have a vector of int32 b, a vector of double xi and 4 int32 indices packed in a 128 bit register bidx. I need to gather first from vector b, than from vector xi. I.e., in pseudo code, I need to do:
__m128i i = b[idx];
__m256d x = xi[i];
In the function below, I implement this in two ways using an #ifdef: via gather instructions, yielding a throughput of 290 Miter/sec and via elementary operations, yielding a throughput of 325 Miter/sec.
Can somebody explain what is going on? Thanks
inline void resolve( const __m256d& z, const __m128i& bidx, int32_t j
, const int32_t *b, const double *xi, int32_t* ri )
{
__m256d x;
__m128i i;
#if 0 // this code uses two gather instructions in sequence
i = _mm_i32gather_epi32(b, bidx, 4)); // i = b[bidx]
x = _mm256_i32gather_pd(xi, i, 8); // x = xi[i]
#else // this code does not use gather instructions
union {
__m128i vec;
int32_t i32[4];
} u;
x = _mm256_set_pd
( xi[(u.i32[3] = b[_mm_extract_epi32(bidx,3)])]
, xi[(u.i32[2] = b[_mm_extract_epi32(bidx,2)])]
, xi[(u.i32[1] = b[_mm_extract_epi32(bidx,1)])]
, xi[(u.i32[0] = b[_mm_cvtsi128_si32(bidx) ])]
);
i = u.vec;
#endif
// here we use x and i
__m256 ps256 = _mm256_castpd_ps(_mm256_cmp_pd(z, x, _CMP_LT_OS));
__m128 lo128 = _mm256_castps256_ps128(ps256);
__m128 hi128 = _mm256_extractf128_ps(ps256, 1);
__m128 blend = _mm_shuffle_ps(lo128, hi128, 0 + (2<<2) + (0<<4) + (2<<6));
__m128i lt = _mm_castps_si128(blend); // this is 0 or -1
i = _mm_add_epi32(i, lt);
_mm_storeu_si128(reinterpret_cast<__m128i*>(ri)+j, i);
}
Since your 'resolve' function is marked as inline I suppose it's called in a high frequency loop. Then you might also have a look at the dependencies of the input parameters from each other outside the 'resolve' function. The compiler might be able to optimize the inlined code better across loop boundaries when using the scalar code variant.

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.
How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.
Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.
Edit2: Potential solution, though not necessarily optimal...
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float sumReal = 0.0;
float sumImaginary = 0.0;
__m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);
// Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);
// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);
One fairly straightforward solution which requires only AVX (not AVX2):
__m128i v0 = _mm256_castps256_ps128(v); // get low 2 complex values
__m128i v1 = _mm256_extractf128_ps(v, 1); // get high 2 complex values
v0 = _mm_add_ps(v0, v1); // add high and low
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(1, 0, 3, 2));
v0 = _mm_add_ps(v0, v1); // combine two halves of result
The result will be in v0 as { sum.re, sum.im, sum.re, sum.im }.

Grayscale bilinear patch extraction - SSE optimization

My program makes an intensive use of small sub-images extracted using bilinear interpolation from larger grayscale images.
I am using the following function for this purpose:
bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
const int hsize = patch.rows/2;
// ...
// Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers
// ...
int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
return false;
float x=patch_ctr.x-hsize-floorx;
float y=patch_ctr.y-hsize-floory;
float xy = x*y;
float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
int img_stride = img.cols-patch.cols;
uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
uchar* buff_img1 = buff_img0+img.cols;
uchar* buff_patch = (uchar*)patch.data;
for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
for(int u=0; u<patch.cols; ++u,++buff_patch,++buff_img0,++buff_img1)
buff_patch[0] = cv::saturate_cast<uchar>(buff_img0[0]*w00+buff_img0[1]*w01+buff_img1[0]*w10+buff_img1[1]*w11);
}
return true;
}
Long story short, I am already using parallelization in other parts of the program, and I am considering using SSE to optimize the execution of this function, because I am mostly using 8x8 patches and it seems like a good idea to process bunches of 8 pixels at a time using SSE.
However, I am not sure how to deal with the multiplication by the float interpolation weights (i.e. w00, w01, w10 and w11. These weights are necessarily positive and smaller than 1, hence the multiplication cannot overflow the unsigned char datatype.
Does anyone know how to proceed ?
EDIT:
I tried to do this as follows (assuming 16x16 patches), but there is no significant speed-up:
bool extract_patch_bilin_16x16(const cv::Point2f& patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
// ...
// Precondition checks
// ...
const int hsize = patch.rows/2;
int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
// Check that the full extracted patch is inside the image
if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
return false;
// Compute the constant bilinear weights
float x=patch_ctr.x-hsize-floorx;
float y=patch_ctr.y-hsize-floory;
float xy = x*y;
float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
// Prepare image resampling loop
int img_stride = img.cols-patch.cols;
uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
uchar* buff_img1 = buff_img0+img.cols;
uchar* buff_patch = (uchar*)patch.data;
// Precompute weighting variables
const __m128i CONST_0 = _mm_setzero_si128();
__m128i w00x256_32i = _mm_set1_epi32(cvRound(w00*256));
__m128i w01x256_32i = _mm_set1_epi32(cvRound(w01*256));
__m128i w10x256_32i = _mm_set1_epi32(cvRound(w10*256));
__m128i w11x256_32i = _mm_set1_epi32(cvRound(w11*256));
__m128i w00x256_16i = _mm_packs_epi32(w00x256_32i,w00x256_32i);
__m128i w01x256_16i = _mm_packs_epi32(w01x256_32i,w01x256_32i);
__m128i w10x256_16i = _mm_packs_epi32(w10x256_32i,w10x256_32i);
__m128i w11x256_16i = _mm_packs_epi32(w11x256_32i,w11x256_32i);
// Process pixels
int ngroups = patch.rows>>4;
for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
for(int g=0; g<ngroups; ++g,buff_patch+=16,buff_img0+=16,buff_img1+=16) {
////////////////////////////////
// Load the data (16 pixels in one load)
////////////////////////////////
__m128i val00 = _mm_loadu_si128((__m128i*)buff_img0);
__m128i val01 = _mm_loadu_si128((__m128i*)(buff_img0+1));
__m128i val10 = _mm_loadu_si128((__m128i*)buff_img1);
__m128i val11 = _mm_loadu_si128((__m128i*)(buff_img1+1));
////////////////////////////////
// Process the lower 8 values
////////////////////////////////
// Unpack into 16-bits integers
__m128i val00_lo = _mm_unpacklo_epi8(val00,CONST_0);
__m128i val01_lo = _mm_unpacklo_epi8(val01,CONST_0);
__m128i val10_lo = _mm_unpacklo_epi8(val10,CONST_0);
__m128i val11_lo = _mm_unpacklo_epi8(val11,CONST_0);
// Multiply with the integer weights
__m128i w256val00_lo = _mm_mullo_epi16(val00_lo,w00x256_16i);
__m128i w256val01_lo = _mm_mullo_epi16(val01_lo,w01x256_16i);
__m128i w256val10_lo = _mm_mullo_epi16(val10_lo,w10x256_16i);
__m128i w256val11_lo = _mm_mullo_epi16(val11_lo,w11x256_16i);
// Divide by 256 to get the approximate result of the multiplication with floating-point weights
__m128i wval00_lo = _mm_srli_epi16(w256val00_lo,8);
__m128i wval01_lo = _mm_srli_epi16(w256val01_lo,8);
__m128i wval10_lo = _mm_srli_epi16(w256val10_lo,8);
__m128i wval11_lo = _mm_srli_epi16(w256val11_lo,8);
// Add pairwise
__m128i sum0_lo = _mm_add_epi16(wval00_lo,wval01_lo);
__m128i sum1_lo = _mm_add_epi16(wval10_lo,wval11_lo);
__m128i final_lo = _mm_add_epi16(sum0_lo,sum1_lo);
////////////////////////////////
// Process the higher 8 values
////////////////////////////////
// Unpack into 16-bits integers
__m128i val00_hi = _mm_unpackhi_epi8(val00,CONST_0);
__m128i val01_hi = _mm_unpackhi_epi8(val01,CONST_0);
__m128i val10_hi = _mm_unpackhi_epi8(val10,CONST_0);
__m128i val11_hi = _mm_unpackhi_epi8(val11,CONST_0);
// Multiply with the integer weights
__m128i w256val00_hi = _mm_mullo_epi16(val00_hi,w00x256_16i);
__m128i w256val01_hi = _mm_mullo_epi16(val01_hi,w01x256_16i);
__m128i w256val10_hi = _mm_mullo_epi16(val10_hi,w10x256_16i);
__m128i w256val11_hi = _mm_mullo_epi16(val11_hi,w11x256_16i);
// Divide by 256 to get the approximate result of the multiplication with floating-point weights
__m128i wval00_hi = _mm_srli_epi16(w256val00_hi,8);
__m128i wval01_hi = _mm_srli_epi16(w256val01_hi,8);
__m128i wval10_hi = _mm_srli_epi16(w256val10_hi,8);
__m128i wval11_hi = _mm_srli_epi16(w256val11_hi,8);
// Add pairwise
__m128i sum0_hi = _mm_add_epi16(wval00_hi,wval01_hi);
__m128i sum1_hi = _mm_add_epi16(wval10_hi,wval11_hi);
__m128i final_hi = _mm_add_epi16(sum0_hi,sum1_hi);
////////////////////////////////
// Repack all values
////////////////////////////////
__m128i final_val = _mm_packus_epi16(final_lo,final_hi);
_mm_storeu_si128((__m128i*)buff_patch,final_val);
}
}
}
Any idea what could be done to improve the speed-up ?
I would consider sticking to integers: your weights are multiples of 1/64 so that working with fixed-point 8.6 is enough and that fits in 16 bits numbers.
Bilinear interpolation is best done as three linear ones (two on Y then one on X; you can reuse the second Y interpolation for the neighboring patch).
To perform a linear interpolation between two values, you will pre-store once for all the interpolation weights P and Q (8 to 1 and 0 to 7), and multiply and add them in pairs like V0.P[i]+V1.Q[i]. This is efficiently done using the PMADDUBSW instruction. (After appropriate data interleaving, and replication of the values V0 and V1, with PUNPCKLBW and the like).
In the end, divide by the total weight (PSRLW), rescale to bytes (PACKUSWB). (This step can be performed once only, combining the two interpolations.)
You could think of doubling all weights, so that the final scaling is by 8 bits, and PACKUSWB would suffice, but unfortunately it saturates the values and there is no unsaturated equivalent.
It could be that precomputing all 64 interpolation weights and summing the four bilinear terms is better.
UPDATE:
If the goal is to interpolate with fixed coefficients for all pixels quads (actually achieving subpixel translation), the strategy is different.
You will load a run of 8 (16 ?) pixels corresponding to the upper-left corners, a run of 8 shifted one pixel to the right (corresponding to the upper-right corners), and similarly for the next row (bottom coners); multiply and add in pairs (PMADDUBSW) the pixel values to the corresponding interpolation weights, and combine the pairs (PADDW). Store the weights with replication.
Another option will be to avoid the (PMADD) and perform separate multiplies (PMULLW) and adds (PADDW). This will simplify the reorganization scheme.
After scaling (as above), you end up with a run of 8 interpolated values.
This can work as well for variable interpolation weights, as long as you interpolate exactly one pixel per quad.

Horizontal minimum and maximum using SSE

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.
I have been using the following implementation for the minimum for instance:
static inline int16_t hMin(__m128i buffer) {
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
return ((int8_t*) ((void *) &buffer))[0];
}
I need to compute the minimum and the maximum of 16 1-byte integers, as you see.
Any good suggestions are highly appreciated :)
Thanks
SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.
If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
Use _mm_minpos_epu16 to find minimum among these values.
Extract the resulting minimum value with _mm_cvtsi128_si32.
Undo effect of step 1 to get the original byte value.
Here is an example that returns maximum of 16 signed bytes:
static inline int16_t hMax(__m128i buffer)
{
__m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
__m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
__m128i tmp3 = _mm_minpos_epu16(tmp2);
return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}
I suggest two changes:
Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:
static inline int16_t hMin(__m128i buffer)
{
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
return (int8_t)_mm_cvtsi128_si32(buffer);
}
here's an implementation without shuffle, shuffle is slow on AMD 5000 Ryzen 7 for some reason
float max_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_max_ps(a, b); // ..., max(x, z), ..., ...
Vector4 res = _mm_max_ps(mm, c); // ..., max(y, max(x, z)), ..., ...
return res.y;
}
float min_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_min_ps(a, b); // ..., min(x, z), ..., ...
Vector4 res = _mm_min_ps(mm, c); // ..., min(y, min(x, z)), ..., ...
return res.y;
}