SIMD: Accumulate Adjacent Pairs

SIMD: Accumulate Adjacent Pairs - c++

I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.
I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.
template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
) {
const bool odd_x = (sx & 0x01);
size_t i_idx, o_idx;
// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) {
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];
}
if (odd_x) {
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
}
}
However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.
(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)
Thanks everyone. :)

The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a) {
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
}
// SSE2
__m128i hadd_widen16_to_32(__m128i a) {
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
}
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)
The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) {
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
}
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) {
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
}
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
{
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
}
On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)

Related

Set the leading zero bits in any size integer C++

I want to set the leading zero bits in any size integer to 1 in standard C++.
eg.
0001 0011 0101 1111 -> 1111 0011 0101 1111
All the algorithms I've found to do this require a rather expensive leading zero count. However, it's odd. There are very fast and easy ways to do other types of bit manipulation such as:
int y = -x & x; //Extracts lowest set bit, 1110 0101 -> 0000 0001
int y = (x + 1) & x; //Will clear the trailing ones, 1110 0101 - > 1110 0100
int y = (x - 1) | x; //Will set the trailing zeros, 0110 0100 - > 0110 0111
So that makes me think there must be a way to set the leading zeros of an integer in one simple line of code consisting of basic bit wise operators. Please tell me there's hope because right now I'm settling for reversing the order of the bits in my integer and then using the fast way of setting trailing zeros and then reversing the integer again to get my leading zeros set to ones. Which is actually significantly faster than using a leading zero count, however still quite slow compared with the other algorithms above.
template<typename T>
inline constexpr void reverse(T& x)
{
T rev = 0;
size_t s = sizeof(T) * CHAR_BIT;
while(s > 0)
{
rev = (rev << 1) | (x & 0x01);
x >>= 1;
s -= 1uz;
}//End while
x = rev;
}
template<typename T>
inline constexpr void set_leading_zeros(T& x)
{
reverse(x);
x = (x - 1) | x;//Set trailing 0s to 1s
reverse(x);
}
Edit
Because some asked: I'm working with MS-DOS running on CPUs ranging from early X86 to a 486DX installed in older CNC machines.
Fun times. :D

The leading zeroes can be set without counting them, while also avoiding reversing the integer. For convenience I won't do it for a generic integer type T, but likely it can be adapted, or you could use template specialization.
First calculate the mask of all the bits that aren't the leading zeroes, by "spreading" the bits downwards:
uint64_t m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;
Then set all the bits that that mask doesn't cover:
return x | ~m;
Bonus: this automatically works even when x = 0 and when x has all bits set, one of which in a count-leading-zero approach could lead to an overly large shift amount (which one depends on the details, but almost always one of them is troublesome, since there are 65 distinct cases but only 64 valid shift amounts, if we're talking about uint64_t).

You could count leading zeroes using std::countl_zero and create a bitmask that your bitwise OR with the original value:
#include <bit>
#include <climits>
#include <type_traits>
template<class T>
requires std::is_unsigned_v<T>
T leading_ones(T v) {
auto lz = std::countl_zero(v);
return lz ? v | ~T{} << (CHAR_BIT * sizeof v - lz) : v;
}
If you have a std::uint16_t, like
0b0001001101011111
then ~T{} is 0b1111111111111111, CHAR_BIT * sizeof v is 16 and countl_zero(v) is 3. Left shift 0b1111111111111111 16-3 steps:
0b1110000000000000
Bitwise OR with the original:
0b0001001101011111
| 0b1110000000000000
--------------------
= 0b1111001101011111

Your reverse is extremely slow! With an N-bit int you need N iterations to reverse, each at least 6 instructions, then at least 2 instructions to set the trailing bits, and finally N iterations to reverse the value again. OTOH even the simplest leading zero count needs only N iterations, then set the leading bits directly:
template<typename T>
inline constexpr T trivial_ilog2(T x) // Slow, don't use this
{
if (x == 0) return 0;
size_t c{};
while(x)
{
x >>= 1;
c += 1u;
}
return c;
}
template<typename T>
inline constexpr T set_leading_zeros(T x)
{
if (std::make_unsigned_t(x) >> (sizeof(T) * CHAR_BIT - 1)) // top bit is set
return x;
return x | (-T(1) << trivial_ilog2(x));
}
x = set_leading_zeros(x);
There are many other ways to count leading zero/get integer logarithm much faster. One of the methods involves doing in steps of powers of 2 like how to create the mask in harold's answer:
What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
Find the highest order bit in C
How to efficiently count the highest power of 2 that is less than or equal to a given number?
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup
But since you're targeting a specific target instead of doing something cross-platform and want to squeeze every bit of performance, there are almost no reasons to use pure standard features since these usecases typically need platform-specific code. If intrinsics are available you should use it, for example in modern C++ there's std::countl_zero but each compiler already has intrinsics to do that which will map to the best instruction sequence for that platform, for example _BitScanReverse or __builtin_clz
If intrinsics aren't available of if the performance is still not enough then try a lookup table. For example here's a solution with 256-element log table
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};
uint16_t lut_ilog2_16(uint16_t x)
{
uint8_t h = x >> 8;
if (h) return LogTable256[h] + 8;
else return LogTable256[x & 0xFF];
}
In set_leading_zeros just call lut_ilog2_16 like above
The even better solution than a log table is a mask table so that you can get the mask directly instead of calculating 1 << LogTable256[x]
static const char MaskTable256[256] =
{
0xFF, 0xFE, 0xFC...
}
Some other notes:
1uz isn't a valid suffix in C++ prior to C++23
Don't use references for small types that fit in a single integer. That's not necessary and is usually slower when not inlined. Just assign the result back from the function

(Work in progress, power just went out here; posting now to save my work.)
Crusty old x86 CPUs have very slow C++20 std::countl_zero / GNU C __builtin_clz (386 bsr = Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.) For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.
(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah / mov ah, 0 instead of shr ax, 8, since 286 and earlier didn't have a barrel shifter for constant-time shifts. But in C++, that's something for the compiler to figure out. Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)
8086 to 286 - no instruction available.
386: bsf/bsr: 10+3n cycles. Worst-case: 10+3*31 = 103c
486: bsf (16 or 32-bit registers): 6-42 cycles; bsr 7-104 cycles (1 cycle less for 16-bit regs).
P5 Pentium: bsf: 6-42 cycles (6-34 for 16-bit); bsr 7-71 cycles. (or 7-39 for 16-bit). Non-pairable.
Intel P6 and later: bsr/bsr: 1 uop with 1 cycle throughput, 3 cycle latency. (PPro / PII and later).
AMD K7/K8/K10/Bulldozer/Zen: bsf/bsr are slowish for a modern CPU. e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.
Intel Haswell / AMD K10 : lzcnt introduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before tzcnt and the rest of BMI1).
For an input of 0, they return the operand-size, so they fully implement C++20 std::countl_zero / countr_zero respectively, unlike bsr/bsf. (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents. Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)
On AMD, they're fast, single uop for lzcnt, with tzcnt taking one more (probably a bit-reverse to feed the lzcnt execution unit), so a nice win vs. bsf/bsr. This is why compilers typically use rep bsf when for countr_zero / __builtin_ctz, so it will run as tzcnt on CPUs that support it, but as bsf on older CPUs. They produce the same results for non-zero inputs, unlike bsr/lzcnt.
On Intel, same fast performance as bsf/bsr, even including the output dependency until Skylake fixed that; it's a true dependency for bsf/bsr, but false dependency for tzcnt/lzcnt and popcnt.
Fast algorithm with a bit-scan building block
But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t on a 64-bit machine. (Or maybe even moreso for uint64_t on a 32-bit machine, where each shift would require shifting bits across the gap.)
Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium. (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)
bsr (index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD. On even more recent x86, BMI1 lzcnt is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer). Also it works for an input of 0 (producing the type width aka operand size), unlike bsr which leaves the destination register unmodified.
I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask. ISO C++ doesn't guarantee that >> is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour. (For example, GNU C documents it.)
https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.
Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.
int set_leading_zeros(int x){
int lz = __builtin_clz(x|1); // clamp the lzcount to 31 at most
int tmp = (x<<lz); // shift out leading zeros, leaving a 1 (or 0 if x==0)
tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1); // set the MSB in case x==0
return tmp>>lz; // sign-extend with an arithmetic right shift.
}
#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
int32_t lz = _lzcnt_u32(x); // returns 0 to 32
uint32_t mask = _bzhi_u32(-1, lz); // handles all 33 possible values, producing 0 for lz=32
return x | ~mask;
}
On x86-64 you can
Combined with BMI2 shlx / sarx for single-uop variable-count shifts even on Intel CPUs.
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend. Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient.
Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg

Efficiently accumulate sign bits in arm neon

I have a loop that does some computations and then stores sign bits into a vector:
uint16x8_t rotate(const uint16_t* x);
void compute(const uint16_t* src, uint16_t* dst)
{
uint16x8_t sign0 = vmovq_n_u16(0);
uint16x8_t sign1 = vmovq_n_u16(0);
for (int i=0; i<16; ++i)
{
uint16x8_t r0 = rotate(src++);
uint16x8_t r1 = rotate(src++);
// pseudo code:
sign0 |= (r0 >> 15) << i;
sign1 |= (r1 >> 15) << i;
}
vst1q_u16(dst+1, sign0);
vst1q_u16(dst+8, sign1);
}
What's the best way to accumulate sign bits in neon that follows that pseudo code?
Here's what I came up with:
r0 = vshrq_n_u16(r0, 15);
r1 = vshrq_n_u16(r1, 15);
sign0 = vsraq_n_u16(vshlq_n_u16(r0, 15), sign0, 1);
sign1 = vsraq_n_u16(vshlq_n_u16(r1, 15), sign1, 1);
Also, note that the "pseudo code" actually works and generates pretty much the same code perf wise. What can be improved here? Note, in actual code there is no function calls in the loop, I trimmed down actual code to make it simple to understand.
Another point: in neon you cannot use a variable for vector shift (e.g. i cannot use used to specify number of shifts).

ARM can do this in one vsri instruction (thanks #Jake'Alquimista'LEE).
Given a new vector where that you want sign bits from, replace the low 15 bits of each element with the accumulator right-shifted by 1.
You should unroll by 2 so the compiler doesn't need a mov instruction to copy the result back into the same register, because vsri is a 2-operand instruction, and the way we need to use it here gives us the result in a different register than the old sign0 accumulator.
sign0 = vsriq_n_u16(r0, sign0, 1);
// insert already-accumulated bits below the new bit we want
After 15 inserts, (or 16 if you start with sign0 = 0 instead of peeling the first iteration and using sign0=r0), all 16 bits (per element) of sign0 will be sign bits from r0 values.
Previous suggestion: AND with a vector constant to isolate the sign bit. It's more efficient than two shifts.
Your idea of accumulating with VSRA to shift the accumulator and add in the new bit is good, so we can keep that and get down to 2 instructions total.
tmp = r0 & 0x8000; // VAND
sign0 = (sign0 >> 1) + tmp; // VSRA
or using neon intrinsics:
uint16x8_t mask80 = vmovq_n_u16(0x8000);
r0 = vandq_u16(r0, mask80); // VAND
sign0 = vsraq_n_u16(r0, sign0, 1); // VSRA
Implement with intrinsics or asm however you like, and write the scalar version the same way to give the compiler a better chance to auto-vectorize.
This does need a vector constant in a register. If you're very tight on registers, then 2 shifts could be better, but 3 shifts total seems likely to bottleneck on shifter throughput unless ARM chips typically spend a lot of real-estate on SIMD barrel shifters.
In that case, maybe use this generic SIMD idea without ARM shift+accumulate or shift+insert
tmp = r0 >> 15; // logical right shift
sign0 += sign0; // add instead of left shifting
sign0 |= tmp; // or add or xor or whatever.
This gives you the bits in the opposite order. If you can produce them in the opposite order, then great.
Otherwise, does ARM have SIMD bit-reverse or only for scalar? (Generate in reverse order and flip them at the end, with some extra work for every vector bitmap, hopefully only one instruction.)
Update: yes, AArch64 has rbit, so you could reverse bits within a byte, then byte-shuffle to put them in the right order. x86 could use a pshufb LUT to bit-reverse within bytes in two 4-bit chunks. This might not come out ahead of doing more work as you accumulate the bits on x86, though.

How to vectorize range check during block copy?

I have the function below:
void CopyImageBitsWithAlphaRGBA(unsigned char *dest, const unsigned char *src, int w, int stride, int h,
unsigned char minredmask, unsigned char mingreenmask, unsigned char minbluemask, unsigned char maxredmask, unsigned char maxgreenmask, unsigned char maxbluemask)
{
auto pend = src + w * h * 4;
for (auto p = src; p < pend; p += 4, dest += 4)
{
dest[0] = p[0]; dest[1] = p[1]; dest[2] = p[2];
if ((p[0] >= minredmask && p[0] <= maxredmask) || (p[1] >= mingreenmask && p[1] <= maxgreenmask) || (p[2] >= minbluemask && p[2] <= maxbluemask))
dest[3] = 255;
else
dest[3] = 0;
}
}
What it does is it copies a 32 bit bitmap from one memory block to another, setting the alpha channel to fully transparent when the pixel color falls within a certain color range.
How do I make this use SSE/AVX in VC++ 2017? Right now it's not generating vectorized code. Failing an automatic way of doing it, what functions can I use to do this myself?
Because really, I'd imagine testing if bytes are in a range would be one of the most obviously useful operations possible, but I can't see any built in function to take care of it.

I don't think you're going to get a compiler to auto-vectorize as well as you can do by hand with Intel's intrinsics. (err, as well as I can do by hand anyway :P).
Possibly once we manually vectorize it, we can see how to hand-hold a compiler with scalar code that works that way, but we really need packed-compare into a 0/0xFF with byte elements, and it's hard to write something in C that compilers will auto-vectorize well. The default integer promotions mean that most C expressions actually produce 32-bit results, even when you use uint8_t, and that often tricks compilers into unpacking 8-bit to 32-bit elements, costing a lot of shuffles on top of the automatic factor of 4 throughput loss (fewer elements per register), like in #harold's small tweak to your source.
SSE/AVX (before AVX512) has signed comparisons for SIMD integer, not unsigned. But you can range-shift things to signed -128..127 by subtracting 128. XOR (add-without-carry) is slightly more efficient on some CPUs, so you actually just XOR with 0x80 to flip the high bit. But mathematically you're subtracting 128 from a 0..255 unsigned value, giving a -128..127 signed value.
It's even still possible to implement the "unsigned compare trick" of (x-min) < (max-min). (For example, detecting alphabetic ASCII characters). As a bonus, we can bake the range-shift into that subtract. If x<min, it wraps around and becomes a large value greater than max-min. This obviously works for unsigned, but it does in fact work (with a range-shifted max-min) with SSE/AVX2 signed-compare instructions. (A previous version of this answer claimed this trick only worked if max-min < 128, but that's not the case. x-min can't wrap all the way around and become lower than max-min, or get into that range if it started above max).
An earlier version of this answer had code that made the range exclusive, i.e. not including the ends, so you even redmin=0 / redmax=255 would exclude pixels with red=0 or red=255. But I solved that by comparing the other way (thanks to ideas from #Nejc's and #chtz's answers).
#chtz's idea of using a saturating add/sub instead of a compare is very cool. If you arrange things so saturation means in-range, it works for an inclusive range. (And you can set the Alpha component to a known value by choosing a min/max that makes all 256 possible inputs in-range). This lets us avoid range-shifting to signed, because unsigned-saturation is available
We can combine the sub/cmp range-check with the saturation trick to do sub (wraps on out-of-bounds low) / subs (only reaches zero if the first sub didn't wrap). Then we don't need an andnot or or to combine two separate checks on each component; we already have a 0 / non-zero result in one vector.
So it only takes two operations to give us a 32-bit value for the whole pixel that we can check. Iff all 3 RGB components are in-range, that element will have a specific value. (Because we've arranged for the Alpha component to already give a known value, too). If any of the 3 components are out-of-range, it will have some other value.
If you do this the other way, so saturation means out-of-range, then you have an exclusive range in that direction, because you can't choose a limit such that no value reaches 0 or reaches 255. You can always saturate the alpha component to give yourself a known value there, regardless of what it means for the RGB components. An exclusive range would let you abuse this function to be always-false by choosing a range that no pixel could ever match. (Or if there's a third condition, besides per-component min/max, then maybe you want an override).
The obvious thing would be to use a packed-compare instruction with 32-bit element size (_mm256_cmpeq_epi32 / vpcmpeqd) to generate a 0xFF or 0x00 (which we can apply / blend into the original RGB pixel value) for in/out of range.
// AVX2 core idea: wrapping-compare trick with saturation to achieve unsigned compare
__m256i tmp = _mm256_sub_epi8(src, min_values); // wraps to high unsigned if below min
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min); // unsigned saturation to 0 means in-range
__m256i new_alpha = _mm256_cmpeq_epi32(RGB_inrange, _mm256_setzero_si256());
// then blend the high byte of each element with RGB from the src vector
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, _mm256_set1_epi32(0x00FFFFFF)); // alpha from new_alpha, RGB from src
Note that an SSE2 version would only need one MOVDQA instructions to copy src; the same register is the destination for every instruction.
Also note that you could saturate the other direction: add then adds (with (256-max) and (256-(min-max)), I think) to saturate to 0xFF for in-range. This could be useful with AVX512BW if you use zero-masking with a fixed mask (e.g. for alpha) or variable mask (for some other condition) to exclude a component based on some other condition. AVX512BW zero-masking for the sub/subs version would consider components in-range even when they aren't, which could also be useful.
But extending that to AVX512 requires a different approach: AVX512 compares produce a bit-mask (in a mask register), not a vector, so we can't turn around and use the high byte of each 32-bit compare result separately.
Instead of cmpeq_epi32, we can produce the value we want in the high byte of each pixel using carry/borrow from a subtract, which propagates left to right.
0x00000000 - 1 = 0xFFFFFFFF # high byte = 0xFF = new alpha
0x00?????? - 1 = 0x00?????? # high byte = 0x00 = new alpha
Where ?????? has at least one non-zero bit, so it's a 32-bit number >=0 and <=0x00FFFFFFFF
Remember we choose an alpha range that makes the high byte always zero
i.e. _mm256_sub_epi32(RGB_inrange, _mm_set1_epi32(1)). We only need the high byte of each 32-bit element to have the alpha value we want, because we use a byte-blend to merge it with the source RGB values. For AVX512, this avoids a VPMOVM2D zmm1, k1 instruction to convert a compare result back into a vector of 0/-1, or (much more expensive) to interleave each mask bit with 3 zeros to use it for a byte-blend.
This sub instead of cmp has a minor advantage even for AVX2: sub_epi32 runs on more ports on Skylake (p0/p1/p5 vs. p0/p1 for pcmpgt/pcmpeq). On all other CPUs, vector integer add/sub run on the same ports as vector integer compare. (Agner Fog's instruction tables).
Also, if you compile _mm256_cmpeq_epi32() with -march=native on a CPU with AVX512, or otherwise enable AVX512 and then compile normal AVX2 intrinsics, some compilers will stupidly use AVX512 compare-into-mask and then expand back to a vector instead of just using the VEX-coded vpcmpeqd. Thus, we use sub instead of cmp even for the _mm256 intrinsics version, because I already spent the time to figure it out and show that it's at least as efficient in the normal case of compiling for regular AVX2. (Although _mm256_setzero_si256() is cheaper than set1(1); vpxor can zero a register cheaply instead of loading a constant, but this setup happens outside the loop.)
#include <immintrin.h>
#ifdef __AVX2__
// inclusive min and max
__m256i setAlphaFromRangeCheck_AVX2(__m256i src, __m256i mins, __m256i max_minus_min)
{
__m256i tmp = _mm256_sub_epi8(src, mins); // out-of-range wraps to a high signed value
// (x-min) <= (max-min) equivalent to:
// (x-min) - (max-min) saturates to zero
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);
// 0x00000000 for in-range pixels, 0x00?????? (some higher value) otherwise
// this has minor advantages over compare against zero, see full comments on Godbolt
__m256i new_alpha = _mm256_sub_epi32(RGB_inrange, _mm256_set1_epi32(1));
// 0x00000000 - 1 = 0xFFFFFFFF
// 0x00?????? - 1 = 0x00?????? high byte = new alpha value
const __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF); // blend mask
// without AVX512, the only byte-granularity blend is a 2-uop variable-blend with a control register
// On Ryzen, it's only 1c latency, so probably 1 uop that can only run on one port. (1c throughput).
// For 256-bit, that's 2 uops of course.
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, RGB_mask); // RGB from src, 0/FF from new_alpha
return alpha_replaced;
}
#endif // __AVX2__
Set up vector args for this function and loop over your array with _mm256_load_si256 / _mm256_store_si256. (Or loadu/storeu if you can't guarantee alignment.)
This compiles very efficiently (Godbolt Compiler explorer) with gcc, clang, and MSVC. (AVX2 version on Godbolt is good, AVX512 and SSE versions are still a mess, not all the tricks applied to them yet.)
;; MSVC's inner loop from a caller that loops over an array with it:
;; see the Godbolt link
$LL4#:
vmovdqu ymm3, YMMWORD PTR [rdx+rax*4]
vpsubb ymm0, ymm3, ymm7
vpsubusb ymm1, ymm0, ymm6
vpsubd ymm2, ymm1, ymm5
vpblendvb ymm3, ymm2, ymm3, ymm4
vmovdqu YMMWORD PTR [rcx+rax*4], ymm3
add eax, 8
cmp eax, r8d
jb SHORT $LL4#
So MSVC managed to hoist the constant setup after inlining. We get similar loops from gcc/clang.
The loop has 4 vector ALU instructions, one of which takes 2 uops. Total 5 vector ALU uops. But total fused-domain uops on Haswell/Skylake = 9 with no unrolling, so with luck this can run at 32 bytes (1 vector) per 2.25 clock cycles. It could come close to actually achieving that with data hot in L1d or L2 cache, but L3 or memory would be a bottleneck. With unrolling, it could maybe bottlenck on L2 cache bandwidth.
An AVX512 version (also included in the Godbolt link), only needs 1 uop to blend, and could run faster in vectors per cycle, thus more than twice as fast using 512-byte vectors.

This is one possible way to make this function work with SSE instructions. I used SSE instead of AVX because I wanted to keep the answer simple. Once you understand how the solution works, rewriting the function with AVX intrinsics should not be much of a problem though.
EDIT: please note that my approach is very similar to one by PeterCordes, but his code should be faster because he uses AVX. If you want to rewrite the function below with AVX intrinsics, change step value to 8.
void CopyImageBitsWithAlphaRGBA(
unsigned char *dest,
const unsigned char *src, int w, int stride, int h,
unsigned char minred, unsigned char mingre, unsigned char minblu,
unsigned char maxred, unsigned char maxgre, unsigned char maxblu)
{
char low = 0x80; // -128
char high = 0x7f; // 127
char mnr = *(char*)(&minred) - low;
char mng = *(char*)(&mingre) - low;
char mnb = *(char*)(&minblu) - low;
int32_t lowest = mnr | (mng << 8) | (mnb << 16) | (low << 24);
char mxr = *(char*)(&maxred) - low;
char mxg = *(char*)(&maxgre) - low;
char mxb = *(char*)(&maxblu) - low;
int32_t highest = mxr | (mxg << 8) | (mxb << 16) | (high << 24);
// SSE
int step = 4;
int sse_width = (w / step)*step;
for (int y = 0; y < h; ++y)
{
for (int x = 0; x < w; x += step)
{
if (x == sse_width)
{
x = w - step;
}
int ptr_offset = y * stride + x;
const unsigned char* src_ptr = src + ptr_offset;
unsigned char* dst_ptr = dest + ptr_offset;
__m128i loaded = _mm_loadu_si128((__m128i*)src_ptr);
// subtract 128 from every 8-bit int
__m128i subtracted = _mm_sub_epi8(loaded, _mm_set1_epi8(low));
// greater than top limit?
__m128i masks_hi = _mm_cmpgt_epi8(subtracted, _mm_set1_epi32(highest));
// lower that bottom limit?
__m128i masks_lo = _mm_cmplt_epi8(subtracted, _mm_set1_epi32(lowest));
// perform OR operation on both masks
__m128i combined = _mm_or_si128(masks_hi, masks_lo);
// are 32-bit integers equal to zero?
__m128i eqzer = _mm_cmpeq_epi32(combined, _mm_setzero_si128());
__m128i shifted = _mm_slli_epi32(eqzer, 24);
// EDIT: fixed a bug:
__m128 alpha_unmasked = _mm_and_si128(loaded, _mm_set1_epi32(0x00ffffff));
__m128i combined = _mm_or_si128(alpha_unmasked, shifted);
_mm_storeu_si128((__m128i*)dst_ptr, combined);
}
}
}
EDIT: as #PeterCordes stated in the comments, the code included a bug that is now fixed.

Based on #PeterCordes solution, but replacing the shift+compare by saturated subtract and adding:
// mins_compl shall be [255-minR, 255-minG, 255-minB, 0]
// maxs shall be [maxR, maxG, maxB, 0]
__m256i setAlphaFromRangeCheck(__m256i src, __m256i mins_compl, __m256i maxs)
{
__m256i in_lo = _mm256_adds_epu8(src, mins_compl); // is 255 iff src+mins_coml>=255, i.e. src>=mins
__m256i in_hi = _mm256_subs_epu8(src, maxs); // is 0 iff src - maxs <= 0, i.e., src <= maxs
__m256i inbounds_components = _mm256_andnot_si256(in_hi, in_lo);
// per-component mask, 0xff, iff (mins<=src && src<=maxs).
// alpha-channel is always (~src & src) == 0
// Use a 32-bit element compare to check that all 3 components are in-range
__m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);
__m256i inbounds = _mm256_cmpeq_epi32(inbounds_components, RGB_mask);
__m256i new_alpha = _mm256_slli_epi32(inbounds, 24);
// alternatively _mm256_andnot_si256(RGB_mask, inbounds) ?
// byte blends (vpblendvb) are at least 2 uops, and Haswell requires port5
// instead clear alpha and then OR in the new alpha (0 or 0xFF)
__m256i alphacleared = _mm256_and_si256(src, RGB_mask); // off the critical path
__m256i new_alpha_applied = _mm256_or_si256(alphacleared, new_alpha);
return new_alpha_applied;
}
This saves on vpxor (no modification of src required) and one vpand (the alpha-channel is automatically 0 -- I guess that would be possible with Peter's solution as well by choosing the boundaries accordingly).
Godbolt-Link, apparently, neither gcc nor clang think it is worthwhile to re-use RGB_mask for both usages ...
Simple testing with SSE2 variant: https://wandbox.org/permlink/eVzFHljxfTX5HDcq (you can play around with the source and the boundaries)

Conditionally return only some results from a SIMD calculation, instead of all [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Fastest way to downcast an array short to char

I have to process roughly 2000, 100 element arrays every second. The arrays come to me as shorts, w/ the data in the upper bits and need to be shifted and cast to chars. Is this as efficient as I can get, or is there a faster way to perform this operation? (I have to skip 2 of the values)
for(int i = 0; i < 48; i++)
{
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}

Even if shift and bitwise operation are fast, you can try to process the short array as a char pointer as other advised in comments. It is allowed per standard and for common architectures does what is expected - left the endianness problem.
So you could try to first determine your endianness:
bool isBigEndian() {
short i = 1; // sets only lowest order bit
char *ix = reinterpret_cast<char *>(&i);
return (*ix == 0); // will be 1 if little endian
}
Your loop now becomes:
int shft = isBigEndian()? 0 : 1;
char * pb = reinterpret_cast<char *>(b);
for(int i = 0; i < 48; i++)
{
a[i] = pt[2 * i + shft];
a[i+48] = pt[2 * i + 50 + shft];
}
But as always for low level optimisation, this has to be benchmarked with the compiler and compiler options that will be used in production code.

You could put a wrapper class around these arrays, so code that accesses elements of the wrapper in order actually accesses every other byte of the underlying memory.
This will probably defeat auto-vectorization, though. Other than that, having all the code that would read a actually read b and increment its pointers by two instead of one shouldn't change the cost at all.
The two skipped elements are a problem, though. Having your operator[] do if (i>=48) i+=2 might kill this idea. memmove will often be much faster than storing one byte at a time, so you could consider using memmove to make a contiguous array of shorts that you can index even though it seems silly to copy without storing in a better format.
The trick will be to write a wrapper that completely optimizes away to no extra instructions in loops over your arrays. This is possible on x86, where scaled indexing is available in normal effective-addresses in asm instructions, so if the compiler understands what's going on, it can make code that's just as efficient.
Having arrays of shorts does take twice as much memory, so cache effects could matter.
It all depends on what you need to do with the byte arrays.
If you do need to convert, use SIMD
For x86 targets, you can get a big speedup with SIMD vectors instead of looping one char at a time. For other compile targets you care about, you can write similar special versions. I assume ARM NEON has similar shuffling capability, for example.
When writing a platform-specific version, you also get to make all the endian and unaligned-access assumptions that are true on that platform.
#ifdef __SSE2__ // will be true for all x86-64 builds and most i386 builds
#include <immintrin.h>
static __m128i pack2(const short *p) {
__m128i lo = _mm_loadu_si128((__m128i*)p);
__m128i hi = _mm_loadu_si128((__m128i*)(p + 8));
lo = _mm_srli_epi16(lo, 8); // logical shift, not arithmetic, because we need the high byte to be zero
hi = _mm_srli_epi16(hi, 8);
return _mm_packus_epi16(lo, hi); // treats input as signed, saturates to unsigned 0x0 .. 0xff range
}
#endif // SSE2
void conv(char *a, const short *b) {
#ifdef __SSE2__
for(int i = 0; i < 48; i+=16) {
__m128i low = pack2(b+i);
_mm_storeu_si128((__m128i *)(a+i), low);
__m128i high = pack2(b+i + 50);
_mm_storeu_si128((__m128i *)(a+i + 48), high);
}
#else
/******* Fallback C version *******/
for(int i = 0; i < 48; i++) {
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
#endif
}
As you can see on the Godbolt Compiler Explorer, gcc fully unrolls the loop since it's only a few iterations when storing 16B at a time.
This should perform ok, but on pre-Skylake will bottleneck on shifting both vectors of shorts before the store. Haswell can only sustain one psrli per clock. (Skylake can sustain one per 0.5c when the shift-count is an immediate. See Agner Fog's guide and insn tables, links at the x86 tag wiki.)
You might get better results from loading from (__m128i*)(1 + (char*)p) so the bytes we want are already in the low half of each 16bit element. We'd still have to mask off the high half of each element with _mm_and_si128 instead of shifting, but PAND can run on any vector execution port, so it has three per clock throughput.
More importantly, with AVX it can be combined with an unaligned load. e.g. vpand xmm0, xmm5, [rsi], where xmm5 is a mask of _mm_set1_epi16(0x00ff), and [rsi] holds 2*i + 1 + (char*)b. fused-domain uop throughput is probably going to be an issue, like is common for code with a lot of loads/stores as well as computation.
Unaligned accesses are slightly slower than aligned accesses, but at least half your vector accesses will be unaligned anyway (since skipping two shorts means skipping 4B). On Intel SnB-family CPUs, I don't think it's slower to have loads that are split across a cache-line boundary in a 15:1 split compared to a 12:4 split. (The no-split case is definitely faster, though.) If b is 16B-aligned, then it'll be worth testing the mask version against the shift version.
I didn't write up complete code for this version, because you'll end up reading one byte past the end of b unless you take special precautions. This is fine if you make sure b has padding of some sort so it doesn't go right to the end of a memory page.
AVX2
With AVX2, vpackuswb ymm operates in two separate lanes. IDK if there's anything to gain from doing the load and mask (or shift) on 256b vectors and then using a vextracti128 and 128b pack on the two halves of the 256b vector.
Or maybe do a 256b pack between two vectors and then a vpermq (_mm256_permute4x64_epi64) to sort things out:
lo = _mm256_loadu(b..); // { b[15..8] | b[7..0] }
hi = // { b[31..24] | b[23..16] }
// mask or shift
__m256i packed = _mm256_packus_epi16(lo, hi); // [ a31..24 a15..8 | a23..16 a7..0 ]
packed = _mm256_permute4x64_epi64(packed, _MM_SHUFFLE(3, 1, 2, 0));
Of course, use any portable optimizations you can in the C version. e.g. Serge Ballesta's suggestion of just copying the desired bytes after figuring out their location from the endianness of the machine. (Preferably at compile time by checking GNU C's __BYTE_ORDER__ macro.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js