I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float
Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..
After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).
The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?
a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them in vectors.
Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
Fixed point
If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.
There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a2^4 and b2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.
float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale; // fixed point scale for a: 2^4
const int bscale = 1<<logbscale; // fixed point scale for b: 2^1
const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale)); // re-scale b to match a in the 16bit temporary result
for (i=0 ; i<n; i+=16) {
//__m128i avec = get_scaled_a(i);
//__m128i bvec = get_scaled_b(i);
//__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
//__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);
__m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) ); // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.
__m128i block = _mm_load_si128(&buf[i]); // call this { v[0] .. v[15] }
__m128i lo = _mm_unpacklo_epi8(block, brescale); // {v[0], 8, v[1], 8, ...}
__m128i hi = _mm_unpackhi_epi8(block, brescale); // {v[8], 8, v[9], 8, ...
lo = _mm_maddubs_epi16(lo, abvec); // first arg is unsigned bytes, 2nd arg is signed bytes
hi = _mm_maddubs_epi16(hi, abvec);
// lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }
lo = _mm_srli_epi16(lo, logascale); // truncate from scaled fixed-point to integer
hi = _mm_srli_epi16(hi, logascale);
// and re-pack. Logical, not arithmetic right shift means sign bits can't be set
block = _mm_packuswb(lo, hi);
_mm_store_si128(&buf[i], block);
// then a scalar cleanup loop
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.
Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.
For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:
_mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
_mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
_mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
_mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.
To use these, you'd get scaled 16b vectors of a and b values, then:
unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
left shift the words by 7.
multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
right shift here if you wanted different scales for a and b, to get more fractional precision for a
add b to that.
right shift to do the final truncation back from fixed point to [0..255].
With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.
If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.
expand bytes to floats
If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.
For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.
This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)
char *buf;
// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
__m128 a = get_a(i);
__m128 b = get_b(i);
// IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128. (unlike punpck*)
__m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i)); // load 4B at once.
// Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
floats = _mm_fmadd_ps(floats, a, b); // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
// or without FMA, do this with _mm_mul_ps and _mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(floats);
// repeat 3 more times for buf+4, buf+8, and buf+12, then:
__m128i packed01 = _mm_packss_epi32(dwords0, dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
// packuswb wants SIGNED input, so do signed saturation on the first step
// saturate into [0..255] range
__m12i8 packedbytes=_mm_packus_epi16(packed01, packed23); // SSE2
_mm_store_si128(buf+i, packedbytes); // or storeu if buf isn't aligned.
// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0
(Re: a load that can be a memory source operand for pmovzxbd, see also Loading 8 chars from memory into an __m256 variable as packed single precision floats re: the problems compilers have with this.) And see also GCC bug 99754 - wrong code for _mm_loadu_si32 - reversed vector elements.
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.
See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.
I guess you're looking fro the __m128 _mm_cvtpi8_ps(__m64 a ) composite intrinsic.
Here is a minimal example:
#include <xmmintrin.h>
#include <stdio.h>
int main() {
unsigned char a[4] __attribute__((aligned(32)))= {1,2,3,4};
float b[4] __attribute__((aligned(32)));
_mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
return 0;
I'm trying to make a software that users can move in a wide range(at least 1Mly diameter range and at least 0.1mm position representation precision). I think of 128bit fixed point number to represent position. However, mathematical calculation(e.g. distance, sqrt, divide, integration) is not suitable for fixed(or integer), so I use double or single floating point for math. (Usually on the result of subtracting two int128 coordinates to get a relative distance, so usually the value is small enough to not lose too much precision, or the big diff values needn't so many precision.)
So I encountered a problem when implementing fixed128: how to do fast int128-double conversion with AVX2 SIMD? (AVX512 is not popular so I can't use it in this software)
What I've tried(A bit long, maybe it can be ignored):
I've referred to this answer:How to efficiently perform double/int64 conversions with SSE/AVX?
Wim's answer showed that when we need convert int64 to double, splitting multiple integer to less than 52bits long as significand and concating exponent bits in the left, the do fp math to reduce the extra exponents is efficient.
So I tried to split uint128 (consisting of two uint64s: ilow and ihigh) into three parts:
part1 v_lo: ilow's low 48 bits;
part2 v_mi: ilow's high 16 bits and ihigh's low 16bits;
part3 v_hi: lhigh's high 48 bits;
We can get the v_lo and v_hi with the method almost same as wim's "uint64_to_double_fast_precise", but part2 "v_mi" become a problem. it increased 4 instructions which is more than low+high(1+2).(my code following)
Maybe there's faster way by some magical swizzle with permute/shuffle/unpackhi/unpacklo/broadcast/blend or their combination? These swizzle intrinsic really swizzled me.
my code for ufixed128-double conversion:
constexpr auto fix128_frac_bits = 32;
__m256d ufixed128_to_double_fast(const __m256i& ihigh, const __m256i& ilow)
__m256d magic_d_hm = _mm256_set1_pd(pow(2.0, 52 + 48 - fix128_frac_bits) + pow(2.0, 52 + 80 - fix128_frac_bits));
__m256d magic_d_lo = _mm256_set1_pd(pow(2.0, 52 - fix128_frac_bits));
__m256i magic_i_lo = _mm256_castpd_si256(magic_d_lo);
__m256i magic_i_mi = _mm256_castpd_si256(_mm256_set1_pd(pow(2.0, 52 + 48 - fix128_frac_bits)));
__m256i magic_i_hi = _mm256_castpd_si256(_mm256_set1_pd(pow(2.0, 52 + 80 - fix128_frac_bits)));
//majik operations
__m256i v_lo = _mm256_blend_epi16(ilow, magic_i_lo, 0b10001000);
__m256i v_mi = _mm256_slli_epi64(ihigh, 16);
__m256i losr48 = _mm256_srli_epi64(ilow, 48);
v_mi = _mm256_xor_si256(v_mi, losr48);
v_mi = _mm256_blend_epi32(magic_i_mi, v_mi, 0b01010101);
__m256i v_hi = _mm256_srli_epi64(ihigh, 16);
v_hi = _mm256_xor_si256(v_hi, magic_i_hi);
//final fp
__m256d loresult = _mm256_sub_pd(_mm256_castsi256_pd(v_lo), magic_d_lo);
__m256d result = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_hm);
result = _mm256_add_pd(result, _mm256_castsi256_pd(v_mi));
result = _mm256_add_pd(result, loresult);
return result;
Edit: I've successfully made signed fixed128_to_double, just fp64 add '2.0^(127 - fix128_frac_bits)' into constant 'magic_d_hm' and 'magic_i_hi'.
But there's no fast 'double_to_int128' and 'double_to_uint128' which I have no idea. I can do it faster than C++ 'static_cast' scalar convert
with do bit operstions(mask out exponent and sign, and concat hidden 1,and do left/right shift), but it's much slower than thouse magical ops and use a lot of registers for constants.
Can anyone help me?
If I'm in a blind alley, and there's a better method than fixed128/double-double to represent the wide range position, please tell me. (Except floating-origin or floating-grid(int64-double):they are unstable for physics, or exposes a lot of complexity to the upper construction, or hard to do AVX acceleration.)
About double-double: I planned to compare performance between fixed128 and double-double after highly optimized them, and decide which to use after that. That's another work I'm doing.
my current codes: https://github.com/Veloctor/Int128
In runtime I have 2 ranges defined by their uint32_t borders a..b and c..d. The first range tends to be much greater than the second: 8 < (b - a) / (d - c) < 64.
Exact limits: a >= 0, b <= 2^31 - 1, c >= 0, d <= 2^20 - 1.
I need a routine that performs linear mapping of an integer from the first range onto the second one: f(uint32_t x) -> round_to_uint32_t((float)(x - a) / (b - a) * (d - c) + c).
When b - a >= d - c it is important to mantain the ratio as close to ideal as possible, otherwise in cases when element from [a; b] can be mapped on more than one integer from [c; d] it is okay to return any of these integers.
Sounds like a simple ratio problem and was already answered in many questions like
Convert a number range to another range, maintaining ratio
but here I need a really really fast solution.
This routine is a pivotal part of a specialized sorting algorithm and will be called at least once for every element of a sorted array.
SIMD solution is also acceptable if it doesn't drop overall performance.
Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math); the compiler can't turn x / foo * bar into x * (bar/foo) for you, even though that's very good with loop-invariant bar/foo. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.
_mm256_fmadd_ps looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a). If float rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a)). Using double for this calculation avoids rounding error during conversion to FP of the division operands.
You're reusing the same a,b,c,d for many x, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)
Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i before looping through that scalar.
Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.
Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.
If not SIMD, then even scalar C++ fma() is still good, for vfmadd213sd, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq rather than vcvttps2dq).
Note that uint32_t <-> float conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.
It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x and x-a (and b-a) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF. (Or at least non-negative. Zero is fine.)
Float Rounding
For SIMD, single-precision float is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t can be exactly represented as a float. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.
Using SIMD double is possible but requires some shuffling.
I don't think float is usually a problem for the formula as-written with (float)(x-a). If the b-a input range is large, that means both ranges are large and rounding error isn't going to map all possible x values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a values.
But if we want to factor out the -a * (d - c) / (b - a) part and combine it with the +c at the end, then
We potentially have precision loss from catastrophic cancellation in that value to be added.
We need to do (float)x on the raw input value. If a is huge and b-a is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x values to the same float.
To make best use of FMA, we want to do the +c before converting back to integer, which again risks output rounding error if the d-c is a small output range but c is huge. In your case not a problem; with d <= 2^20 - 1 we know that float can exactly represent every output integer value in that c..d range.
If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U on input and ...+c+0x80000000U on output (after rounding to nearest int32_t). But that would introduce huge float rounding error for small uint32_t inputs (close to 0) which get range-shifted to close to INT_MIN.
We don't need to range-shift for the b-a or d-c because the + or - or XOR with 0x80000000U would cancel out in the subtractions.
The const vectors should be hoisted out of a loop by the compiler after this inlines,
or you can do that manually.
This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.
// fastest but not safe if b-a is small and a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
double scale_scalar = (d - c) / (double)(b - a);
const __m256 scale = _mm256_set1_ps(scale_scalar);
const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
// (x-a) * scale + c
// = x * scale + (-a*scale + c) but with different rounding error from doing -a*scale + c
__m256 in = _mm256_cvtepi32_ps(data);
__m256 out = _mm256_fmadd_ps(in, scale, add);
return _mm256_cvtps_epi32(out); // convert back with round to nearest-even
// _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer
static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
const __m256 cvec = _m256_set1_ps(c);
__m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a)); // add can more easily fold a load of a memory operand than sub because it's commutative. Only some compilers will do this for you.
__m256 in_fp = _mm256_cvtepi32_ps(in_offset);
__m256 out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c)); // in*scale + c
return _mm256_cvtps_epi32(out);
Without FMA you could still use vmulps. You might as well convert back to integer before adding c if you're doing that, although vaddps would be safe.
You might use this in a loop like
void foo(uint32_t *arr, ptrdiff_t len)
if (len < 24) special case;
alignas(32) uint32_t tmpbuf[16];
// peel half of first iteration for software pipelining / loop rotation
__m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
__m256i outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)tmpbuf, outrange);
// could have used an unsigned loop counter
// since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {
// prep next 8 elements
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[8], outrange);
// use first 8 elements
for (int j=0 ; j<8 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
// prep 8 more for next iteration
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[0], outrange);
// use 2nd 8 elements
for (int j=8 ; j<16 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
// use tmpbuf[0..7]
// then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
These variable-names sound dumb but I couldn't think of anything better.
This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange)) if you want.)
Special cases
If there cases where you know (b - a) is a power of 2, you could bitscan with tzcnt or bsf, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz() to count trailing zeros.)
Or can you ensure that (b - a) is always a power of 2?
Or better, if (b - a) / (d - c) is an exact power of 2 the whole thing can just be sub / right shift / add.
If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.
I have the function below:
void CopyImageBitsWithAlphaRGBA(unsigned char *dest, const unsigned char *src, int w, int stride, int h,
unsigned char minredmask, unsigned char mingreenmask, unsigned char minbluemask, unsigned char maxredmask, unsigned char maxgreenmask, unsigned char maxbluemask)
auto pend = src + w * h * 4;
for (auto p = src; p < pend; p += 4, dest += 4)
dest[0] = p[0]; dest[1] = p[1]; dest[2] = p[2];
if ((p[0] >= minredmask && p[0] <= maxredmask) || (p[1] >= mingreenmask && p[1] <= maxgreenmask) || (p[2] >= minbluemask && p[2] <= maxbluemask))
dest[3] = 255;
dest[3] = 0;
What it does is it copies a 32 bit bitmap from one memory block to another, setting the alpha channel to fully transparent when the pixel color falls within a certain color range.
How do I make this use SSE/AVX in VC++ 2017? Right now it's not generating vectorized code. Failing an automatic way of doing it, what functions can I use to do this myself?
Because really, I'd imagine testing if bytes are in a range would be one of the most obviously useful operations possible, but I can't see any built in function to take care of it.
I don't think you're going to get a compiler to auto-vectorize as well as you can do by hand with Intel's intrinsics. (err, as well as I can do by hand anyway :P).
Possibly once we manually vectorize it, we can see how to hand-hold a compiler with scalar code that works that way, but we really need packed-compare into a 0/0xFF with byte elements, and it's hard to write something in C that compilers will auto-vectorize well. The default integer promotions mean that most C expressions actually produce 32-bit results, even when you use uint8_t, and that often tricks compilers into unpacking 8-bit to 32-bit elements, costing a lot of shuffles on top of the automatic factor of 4 throughput loss (fewer elements per register), like in #harold's small tweak to your source.
SSE/AVX (before AVX512) has signed comparisons for SIMD integer, not unsigned. But you can range-shift things to signed -128..127 by subtracting 128. XOR (add-without-carry) is slightly more efficient on some CPUs, so you actually just XOR with 0x80 to flip the high bit. But mathematically you're subtracting 128 from a 0..255 unsigned value, giving a -128..127 signed value.
It's even still possible to implement the "unsigned compare trick" of (x-min) < (max-min). (For example, detecting alphabetic ASCII characters). As a bonus, we can bake the range-shift into that subtract. If x<min, it wraps around and becomes a large value greater than max-min. This obviously works for unsigned, but it does in fact work (with a range-shifted max-min) with SSE/AVX2 signed-compare instructions. (A previous version of this answer claimed this trick only worked if max-min < 128, but that's not the case. x-min can't wrap all the way around and become lower than max-min, or get into that range if it started above max).
An earlier version of this answer had code that made the range exclusive, i.e. not including the ends, so you even redmin=0 / redmax=255 would exclude pixels with red=0 or red=255. But I solved that by comparing the other way (thanks to ideas from #Nejc's and #chtz's answers).
#chtz's idea of using a saturating add/sub instead of a compare is very cool. If you arrange things so saturation means in-range, it works for an inclusive range. (And you can set the Alpha component to a known value by choosing a min/max that makes all 256 possible inputs in-range). This lets us avoid range-shifting to signed, because unsigned-saturation is available
We can combine the sub/cmp range-check with the saturation trick to do sub (wraps on out-of-bounds low) / subs (only reaches zero if the first sub didn't wrap). Then we don't need an andnot or or to combine two separate checks on each component; we already have a 0 / non-zero result in one vector.
So it only takes two operations to give us a 32-bit value for the whole pixel that we can check. Iff all 3 RGB components are in-range, that element will have a specific value. (Because we've arranged for the Alpha component to already give a known value, too). If any of the 3 components are out-of-range, it will have some other value.
If you do this the other way, so saturation means out-of-range, then you have an exclusive range in that direction, because you can't choose a limit such that no value reaches 0 or reaches 255. You can always saturate the alpha component to give yourself a known value there, regardless of what it means for the RGB components. An exclusive range would let you abuse this function to be always-false by choosing a range that no pixel could ever match. (Or if there's a third condition, besides per-component min/max, then maybe you want an override).
The obvious thing would be to use a packed-compare instruction with 32-bit element size (_mm256_cmpeq_epi32 / vpcmpeqd) to generate a 0xFF or 0x00 (which we can apply / blend into the original RGB pixel value) for in/out of range.
// AVX2 core idea: wrapping-compare trick with saturation to achieve unsigned compare
__m256i tmp = _mm256_sub_epi8(src, min_values); // wraps to high unsigned if below min
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min); // unsigned saturation to 0 means in-range
__m256i new_alpha = _mm256_cmpeq_epi32(RGB_inrange, _mm256_setzero_si256());
// then blend the high byte of each element with RGB from the src vector
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, _mm256_set1_epi32(0x00FFFFFF)); // alpha from new_alpha, RGB from src
Note that an SSE2 version would only need one MOVDQA instructions to copy src; the same register is the destination for every instruction.
Also note that you could saturate the other direction: add then adds (with (256-max) and (256-(min-max)), I think) to saturate to 0xFF for in-range. This could be useful with AVX512BW if you use zero-masking with a fixed mask (e.g. for alpha) or variable mask (for some other condition) to exclude a component based on some other condition. AVX512BW zero-masking for the sub/subs version would consider components in-range even when they aren't, which could also be useful.
But extending that to AVX512 requires a different approach: AVX512 compares produce a bit-mask (in a mask register), not a vector, so we can't turn around and use the high byte of each 32-bit compare result separately.
Instead of cmpeq_epi32, we can produce the value we want in the high byte of each pixel using carry/borrow from a subtract, which propagates left to right.
0x00000000 - 1 = 0xFFFFFFFF # high byte = 0xFF = new alpha
0x00?????? - 1 = 0x00?????? # high byte = 0x00 = new alpha
Where ?????? has at least one non-zero bit, so it's a 32-bit number >=0 and <=0x00FFFFFFFF
Remember we choose an alpha range that makes the high byte always zero
i.e. _mm256_sub_epi32(RGB_inrange, _mm_set1_epi32(1)). We only need the high byte of each 32-bit element to have the alpha value we want, because we use a byte-blend to merge it with the source RGB values. For AVX512, this avoids a VPMOVM2D zmm1, k1 instruction to convert a compare result back into a vector of 0/-1, or (much more expensive) to interleave each mask bit with 3 zeros to use it for a byte-blend.
This sub instead of cmp has a minor advantage even for AVX2: sub_epi32 runs on more ports on Skylake (p0/p1/p5 vs. p0/p1 for pcmpgt/pcmpeq). On all other CPUs, vector integer add/sub run on the same ports as vector integer compare. (Agner Fog's instruction tables).
Also, if you compile _mm256_cmpeq_epi32() with -march=native on a CPU with AVX512, or otherwise enable AVX512 and then compile normal AVX2 intrinsics, some compilers will stupidly use AVX512 compare-into-mask and then expand back to a vector instead of just using the VEX-coded vpcmpeqd. Thus, we use sub instead of cmp even for the _mm256 intrinsics version, because I already spent the time to figure it out and show that it's at least as efficient in the normal case of compiling for regular AVX2. (Although _mm256_setzero_si256() is cheaper than set1(1); vpxor can zero a register cheaply instead of loading a constant, but this setup happens outside the loop.)
#include <immintrin.h>
#ifdef __AVX2__
// inclusive min and max
__m256i setAlphaFromRangeCheck_AVX2(__m256i src, __m256i mins, __m256i max_minus_min)
__m256i tmp = _mm256_sub_epi8(src, mins); // out-of-range wraps to a high signed value
// (x-min) <= (max-min) equivalent to:
// (x-min) - (max-min) saturates to zero
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);
// 0x00000000 for in-range pixels, 0x00?????? (some higher value) otherwise
// this has minor advantages over compare against zero, see full comments on Godbolt
__m256i new_alpha = _mm256_sub_epi32(RGB_inrange, _mm256_set1_epi32(1));
// 0x00000000 - 1 = 0xFFFFFFFF
// 0x00?????? - 1 = 0x00?????? high byte = new alpha value
const __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF); // blend mask
// without AVX512, the only byte-granularity blend is a 2-uop variable-blend with a control register
// On Ryzen, it's only 1c latency, so probably 1 uop that can only run on one port. (1c throughput).
// For 256-bit, that's 2 uops of course.
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, RGB_mask); // RGB from src, 0/FF from new_alpha
return alpha_replaced;
#endif // __AVX2__
Set up vector args for this function and loop over your array with _mm256_load_si256 / _mm256_store_si256. (Or loadu/storeu if you can't guarantee alignment.)
This compiles very efficiently (Godbolt Compiler explorer) with gcc, clang, and MSVC. (AVX2 version on Godbolt is good, AVX512 and SSE versions are still a mess, not all the tricks applied to them yet.)
;; MSVC's inner loop from a caller that loops over an array with it:
;; see the Godbolt link
vmovdqu ymm3, YMMWORD PTR [rdx+rax*4]
vpsubb ymm0, ymm3, ymm7
vpsubusb ymm1, ymm0, ymm6
vpsubd ymm2, ymm1, ymm5
vpblendvb ymm3, ymm2, ymm3, ymm4
vmovdqu YMMWORD PTR [rcx+rax*4], ymm3
add eax, 8
cmp eax, r8d
jb SHORT $LL4#
So MSVC managed to hoist the constant setup after inlining. We get similar loops from gcc/clang.
The loop has 4 vector ALU instructions, one of which takes 2 uops. Total 5 vector ALU uops. But total fused-domain uops on Haswell/Skylake = 9 with no unrolling, so with luck this can run at 32 bytes (1 vector) per 2.25 clock cycles. It could come close to actually achieving that with data hot in L1d or L2 cache, but L3 or memory would be a bottleneck. With unrolling, it could maybe bottlenck on L2 cache bandwidth.
An AVX512 version (also included in the Godbolt link), only needs 1 uop to blend, and could run faster in vectors per cycle, thus more than twice as fast using 512-byte vectors.
This is one possible way to make this function work with SSE instructions. I used SSE instead of AVX because I wanted to keep the answer simple. Once you understand how the solution works, rewriting the function with AVX intrinsics should not be much of a problem though.
EDIT: please note that my approach is very similar to one by PeterCordes, but his code should be faster because he uses AVX. If you want to rewrite the function below with AVX intrinsics, change step value to 8.
void CopyImageBitsWithAlphaRGBA(
unsigned char *dest,
const unsigned char *src, int w, int stride, int h,
unsigned char minred, unsigned char mingre, unsigned char minblu,
unsigned char maxred, unsigned char maxgre, unsigned char maxblu)
char low = 0x80; // -128
char high = 0x7f; // 127
char mnr = *(char*)(&minred) - low;
char mng = *(char*)(&mingre) - low;
char mnb = *(char*)(&minblu) - low;
int32_t lowest = mnr | (mng << 8) | (mnb << 16) | (low << 24);
char mxr = *(char*)(&maxred) - low;
char mxg = *(char*)(&maxgre) - low;
char mxb = *(char*)(&maxblu) - low;
int32_t highest = mxr | (mxg << 8) | (mxb << 16) | (high << 24);
// SSE
int step = 4;
int sse_width = (w / step)*step;
for (int y = 0; y < h; ++y)
for (int x = 0; x < w; x += step)
if (x == sse_width)
x = w - step;
int ptr_offset = y * stride + x;
const unsigned char* src_ptr = src + ptr_offset;
unsigned char* dst_ptr = dest + ptr_offset;
__m128i loaded = _mm_loadu_si128((__m128i*)src_ptr);
// subtract 128 from every 8-bit int
__m128i subtracted = _mm_sub_epi8(loaded, _mm_set1_epi8(low));
// greater than top limit?
__m128i masks_hi = _mm_cmpgt_epi8(subtracted, _mm_set1_epi32(highest));
// lower that bottom limit?
__m128i masks_lo = _mm_cmplt_epi8(subtracted, _mm_set1_epi32(lowest));
// perform OR operation on both masks
__m128i combined = _mm_or_si128(masks_hi, masks_lo);
// are 32-bit integers equal to zero?
__m128i eqzer = _mm_cmpeq_epi32(combined, _mm_setzero_si128());
__m128i shifted = _mm_slli_epi32(eqzer, 24);
// EDIT: fixed a bug:
__m128 alpha_unmasked = _mm_and_si128(loaded, _mm_set1_epi32(0x00ffffff));
__m128i combined = _mm_or_si128(alpha_unmasked, shifted);
_mm_storeu_si128((__m128i*)dst_ptr, combined);
EDIT: as #PeterCordes stated in the comments, the code included a bug that is now fixed.
Based on #PeterCordes solution, but replacing the shift+compare by saturated subtract and adding:
// mins_compl shall be [255-minR, 255-minG, 255-minB, 0]
// maxs shall be [maxR, maxG, maxB, 0]
__m256i setAlphaFromRangeCheck(__m256i src, __m256i mins_compl, __m256i maxs)
__m256i in_lo = _mm256_adds_epu8(src, mins_compl); // is 255 iff src+mins_coml>=255, i.e. src>=mins
__m256i in_hi = _mm256_subs_epu8(src, maxs); // is 0 iff src - maxs <= 0, i.e., src <= maxs
__m256i inbounds_components = _mm256_andnot_si256(in_hi, in_lo);
// per-component mask, 0xff, iff (mins<=src && src<=maxs).
// alpha-channel is always (~src & src) == 0
// Use a 32-bit element compare to check that all 3 components are in-range
__m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);
__m256i inbounds = _mm256_cmpeq_epi32(inbounds_components, RGB_mask);
__m256i new_alpha = _mm256_slli_epi32(inbounds, 24);
// alternatively _mm256_andnot_si256(RGB_mask, inbounds) ?
// byte blends (vpblendvb) are at least 2 uops, and Haswell requires port5
// instead clear alpha and then OR in the new alpha (0 or 0xFF)
__m256i alphacleared = _mm256_and_si256(src, RGB_mask); // off the critical path
__m256i new_alpha_applied = _mm256_or_si256(alphacleared, new_alpha);
return new_alpha_applied;
This saves on vpxor (no modification of src required) and one vpand (the alpha-channel is automatically 0 -- I guess that would be possible with Peter's solution as well by choosing the boundaries accordingly).
Godbolt-Link, apparently, neither gcc nor clang think it is worthwhile to re-use RGB_mask for both usages ...
Simple testing with SSE2 variant: https://wandbox.org/permlink/eVzFHljxfTX5HDcq (you can play around with the source and the boundaries)
I have a array called A that contains 32 unsigned char values.
I want to unpack these values in 4 __m256 variables with this rule, assuming we have a index from 0 to 31 regarding all the values from A, the unpacked 4 variable would have these values:
B_0 = A[0], A[4], A[8], A[12], A[16], A[20], A[24], A[28]
B_1 = A[1], A[5], A[9], A[13], A[17], A[21], A[25], A[29]
B_2 = A[2], A[6], A[10], A[14], A[18], A[22], A[26], A[30]
B_3 = A[3], A[7], A[11], A[15], A[19], A[23], A[27], A[31]
To do that, I have this code:
const auto mask = _mm256_set1_epi32( 0x000000FF );
const auto A_values = _mm256_i32gather_epi32(reinterpret_cast<const int*>(A.data(), A_positions.values_, 4);
// This code bellow is equivalent to B_0 = static_cast<float>((A_value >> 24) & 0x000000FF)
const auto B_0 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 24), mask));
const auto B_1 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 16), mask));
const auto B_2 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 8), mask));
const auto B_3 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 0), mask));
This works great, but I wonder if there is some faster way to do that, specially regarding the shift right and and operator that I use to retrieve the values.
Also, just for clarification, I said that array A was of size 32, but that's not true, this array contains way more values, and I need to access it's elements from different positions (but always from blocks of 4 uint8_t) that's why I use _mm256_i32gather_epi23 to retrieve these values. I just restrain the array size in this example for simplicity.
The shift/mask can be combined in a vpshufb. Of course that means there are shuffle mask to worry about, which have to come from somewhere. If they can stay in registers it's no big deal, if they have to be loaded that may kill this technique.
This seems dubious as an optimization on Intel since the shift has a recip.throughput of 0.5 and the AND 0.33, which is better than the 1 that you'd get with a shuffle (Intel processors with two shuffle units did not support AVX2 so they are not relevant, so the shuffle goes to P5). It's still fewer µops, so in the context of other code it may or may not be worth doing, depending on what the bottle neck is. If the rest of the code just uses P01 (typical for FP SIMD), moving µops to P5 is probably a good idea.
On Ryzen it is generally better since vector shifts have a low throughput there. A 256b vpsrad generates 2 µops that both have to go to port 2 (and then there two more µops for the vpand, but they can go to any of four alu ports), 256b vpshufb generates 2 µops that can go to ports 1 and 2. On the other hand, gather is so bad on Ryzen that this is all just noise compared to the huge flood of µops from that. You could gather manually but then it's still a lot of µops, and they'll likely go to P12 which makes this technique bad.
In conclusion I can't tell you whether this is actually faster or not, it depends.
What is the best way to check whether a AVX intrinsic __m256 (vector of 8 float) contains any inf? I tried
__m256 X=_mm256_set1_ps(1.0f/0.0f);
but this compares to true. Note that this method will find nan (which compare to false). So one way is to check for X!=nan && 0*X==nan:
__m256 Y=_mm256_mul_ps(X,_mm256_setzero_ps()); // 0*X=nan if X=inf
However, this appears somewhat lengthy. Is there a faster way?
If you want to check if a vector has any infinities:
#include <limits>
bool has_infinity(__m256 x){
const __m256 SIGN_MASK = _mm256_set1_ps(-0.0);
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_andnot_ps(SIGN_MASK, x);
x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ);
return _mm256_movemask_ps(x) != 0;
If you want a vector mask of the values that are infinity:
#include <limits>
__m256 is_infinity(__m256 x){
const __m256 SIGN_MASK = _mm256_set1_ps(-0.0);
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_andnot_ps(SIGN_MASK, x);
x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ);
return x;
I think a better solution is to use vptest rather than vmovmskps.
bool has_infinity(const __m256 &x) {
__m256 s = _mm256_andnot_ps(_mm256_set1_ps(-0.0), x);
__m256 cmp = _mm256_cmp_ps(s,_mm256_set1_ps(1.0f/0.0f),0);
__m256i cmpi = _mm256_castps_si256(cmp);
return !_mm256_testz_si256(cmpi,cmpi);
The intrinsic _mm256_castps_si256 is only to make the compiler happy "This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency."
vptest is superior to vmovmskps because it sets the zero flag while vmovmskps does not. With vmovmskps the compiler has to generate test to set the zero flag.
This short one tests whether any of floats in a vector are corrupted (NAN or INFINITY) or not:
int is_corrupted( const __m256 & float_v8 ) {
__m256 self_sub_v8 = _mm256_sub_ps( float_v8, float_v8 );
return _mm256_movemask_epi8( _mm256_castps_si256( self_sub_v8 ) );
It's 2 AVX2 instructions only without additional constants, and uses a trick -- any normal "self"-subtraction should end as a zero, so a movemask_epi8 later should extract some of its bits indicating whether it's a zero or a NAN/INFINITY. I haven't tested it on different platforms.
Edit: see Peter's important comments on rounding toward negative.
If you don't mind also detecting NaNs, i.e. to check for numbers that aren't finite, see #gox's answer suggesting subtraction from itself (producing +0.0 in the default rounding mode for finite inputs, else NaN) and then using _mm256_movemask_epi8 to take one bit from each byte, including one from the exponent which will be non-zero for NaNs, or zero for 0.0. Testing movemask & 0x77777777 would let you ignore the sign bit so it works even with FP rounding mode = roundTowardNegative where x-x gives -0.0
If you need to detect infinity specifically, not also NaN
AVX-512F+VL has _mm256_fpclass_ps_mask + _kortestz_mask16_u8. But without AVX-512, it might be most efficient to use AVX2 integer stuff on the bit-pattern.
The IEEE binary32 bit-pattern for infinity is an all-ones exponent field and an all-zero mantissa. And the sign bit indicates whether it's + or - infinity. (NaN is the same exponent but a non-zero mantissa) So there are 2 bit-patterns we want to detect, which differ only in the high bit.
We can do this using AVX2 integer shift + cmpeq operations with only one vector constant, with lower latency than vcmpps even accounting for the bypass latency if the input came from an FP math instruction. And potentially a throughput benefit, as vpslld and/or vpcmpeqd can run on different ports than FP math/compare instructions on some CPUs. (Using a bitwise AND, ANDN, or OR to force the sign bit to a known state, clear or set, could further help with bypass latency on some CPUs, and be even better for throughput, able to execute on a wider choice of back-end execution units on more CPUs.)
(https://uops.info/ / https://agner.org/optimize/)
You could do this with integer operations, like left-shift by 1 to remove the sign bit, then _mm256_cmpeq_epi32 against set1_epi32(0xff000000) (the bit pattern for infinity, left-shifted by 1. All bits set in the exponent, all bits clear in the mantissa, otherwise it's a NaN). Then you'd only need one constant, and the lower latency of integer compare should make up for the possible bypass latency.
int has_infinity_avx2(__m256 v)
__m256i bits = _mm256_castps_si256(v);
bits = _mm256_slli_epi32(bits, 1); // shift out sign bits. Requires AVX2
bits = _mm256_cmpeq_epi32(bits, _mm256_set1_epi32(0xff000000)); // infinity << 1
return _mm256_movemask_epi8(bits);
// or cast for _mm256_movemask_ps if you want to std::countr_zero to find out where in terms of elements instead of byte offsets
I had an earlier idea, but it ends up only helping if you want to test for ALL elements being infinite. Oops.
With AVX2, you can test for all elements being infinity with PTEST. I got this idea for using xor to compare for equality from EOF's comment on this question, which I used for my answer there. I thought I was going to be able to make a shorter version of a test-for-any-inf, but of course pxor only works as a test for all 256b being equal.
#include <limits>
bool all_infinity(__m256 x){
const __m256i SIGN_MASK = _mm256_set1_epi32(0x7FFFFFFF); // -0.0f inverted
const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity());
x = _mm256_xor_si256(x, INF); // other than sign bit, x will be all-zero only if all the bits match.
return _mm256_testz_si256(x, SIGN_MASK); // flags are ready to branch on directly
With AVX512, there's a __mmask8 _mm512_fpclass_pd_mask (__m512d a, int imm8). (vfpclasspd). (See Intel's guide). Its output is a mask register, which you can branch on directly. You can test for any/all of +/- zero, +/- inf, Q/S NaN, Denormal, Negative.