Efficiently accumulate sign bits in arm neon

Efficiently accumulate sign bits in arm neon - c++

I have a loop that does some computations and then stores sign bits into a vector:
uint16x8_t rotate(const uint16_t* x);
void compute(const uint16_t* src, uint16_t* dst)
{
uint16x8_t sign0 = vmovq_n_u16(0);
uint16x8_t sign1 = vmovq_n_u16(0);
for (int i=0; i<16; ++i)
{
uint16x8_t r0 = rotate(src++);
uint16x8_t r1 = rotate(src++);
// pseudo code:
sign0 |= (r0 >> 15) << i;
sign1 |= (r1 >> 15) << i;
}
vst1q_u16(dst+1, sign0);
vst1q_u16(dst+8, sign1);
}
What's the best way to accumulate sign bits in neon that follows that pseudo code?
Here's what I came up with:
r0 = vshrq_n_u16(r0, 15);
r1 = vshrq_n_u16(r1, 15);
sign0 = vsraq_n_u16(vshlq_n_u16(r0, 15), sign0, 1);
sign1 = vsraq_n_u16(vshlq_n_u16(r1, 15), sign1, 1);
Also, note that the "pseudo code" actually works and generates pretty much the same code perf wise. What can be improved here? Note, in actual code there is no function calls in the loop, I trimmed down actual code to make it simple to understand.
Another point: in neon you cannot use a variable for vector shift (e.g. i cannot use used to specify number of shifts).

ARM can do this in one vsri instruction (thanks #Jake'Alquimista'LEE).
Given a new vector where that you want sign bits from, replace the low 15 bits of each element with the accumulator right-shifted by 1.
You should unroll by 2 so the compiler doesn't need a mov instruction to copy the result back into the same register, because vsri is a 2-operand instruction, and the way we need to use it here gives us the result in a different register than the old sign0 accumulator.
sign0 = vsriq_n_u16(r0, sign0, 1);
// insert already-accumulated bits below the new bit we want
After 15 inserts, (or 16 if you start with sign0 = 0 instead of peeling the first iteration and using sign0=r0), all 16 bits (per element) of sign0 will be sign bits from r0 values.
Previous suggestion: AND with a vector constant to isolate the sign bit. It's more efficient than two shifts.
Your idea of accumulating with VSRA to shift the accumulator and add in the new bit is good, so we can keep that and get down to 2 instructions total.
tmp = r0 & 0x8000; // VAND
sign0 = (sign0 >> 1) + tmp; // VSRA
or using neon intrinsics:
uint16x8_t mask80 = vmovq_n_u16(0x8000);
r0 = vandq_u16(r0, mask80); // VAND
sign0 = vsraq_n_u16(r0, sign0, 1); // VSRA
Implement with intrinsics or asm however you like, and write the scalar version the same way to give the compiler a better chance to auto-vectorize.
This does need a vector constant in a register. If you're very tight on registers, then 2 shifts could be better, but 3 shifts total seems likely to bottleneck on shifter throughput unless ARM chips typically spend a lot of real-estate on SIMD barrel shifters.
In that case, maybe use this generic SIMD idea without ARM shift+accumulate or shift+insert
tmp = r0 >> 15; // logical right shift
sign0 += sign0; // add instead of left shifting
sign0 |= tmp; // or add or xor or whatever.
This gives you the bits in the opposite order. If you can produce them in the opposite order, then great.
Otherwise, does ARM have SIMD bit-reverse or only for scalar? (Generate in reverse order and flip them at the end, with some extra work for every vector bitmap, hopefully only one instruction.)
Update: yes, AArch64 has rbit, so you could reverse bits within a byte, then byte-shuffle to put them in the right order. x86 could use a pshufb LUT to bit-reverse within bytes in two 4-bit chunks. This might not come out ahead of doing more work as you accumulate the bits on x86, though.

Related

Set the leading zero bits in any size integer C++

I want to set the leading zero bits in any size integer to 1 in standard C++.
eg.
0001 0011 0101 1111 -> 1111 0011 0101 1111
All the algorithms I've found to do this require a rather expensive leading zero count. However, it's odd. There are very fast and easy ways to do other types of bit manipulation such as:
int y = -x & x; //Extracts lowest set bit, 1110 0101 -> 0000 0001
int y = (x + 1) & x; //Will clear the trailing ones, 1110 0101 - > 1110 0100
int y = (x - 1) | x; //Will set the trailing zeros, 0110 0100 - > 0110 0111
So that makes me think there must be a way to set the leading zeros of an integer in one simple line of code consisting of basic bit wise operators. Please tell me there's hope because right now I'm settling for reversing the order of the bits in my integer and then using the fast way of setting trailing zeros and then reversing the integer again to get my leading zeros set to ones. Which is actually significantly faster than using a leading zero count, however still quite slow compared with the other algorithms above.
template<typename T>
inline constexpr void reverse(T& x)
{
T rev = 0;
size_t s = sizeof(T) * CHAR_BIT;
while(s > 0)
{
rev = (rev << 1) | (x & 0x01);
x >>= 1;
s -= 1uz;
}//End while
x = rev;
}
template<typename T>
inline constexpr void set_leading_zeros(T& x)
{
reverse(x);
x = (x - 1) | x;//Set trailing 0s to 1s
reverse(x);
}
Edit
Because some asked: I'm working with MS-DOS running on CPUs ranging from early X86 to a 486DX installed in older CNC machines.
Fun times. :D

The leading zeroes can be set without counting them, while also avoiding reversing the integer. For convenience I won't do it for a generic integer type T, but likely it can be adapted, or you could use template specialization.
First calculate the mask of all the bits that aren't the leading zeroes, by "spreading" the bits downwards:
uint64_t m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;
Then set all the bits that that mask doesn't cover:
return x | ~m;
Bonus: this automatically works even when x = 0 and when x has all bits set, one of which in a count-leading-zero approach could lead to an overly large shift amount (which one depends on the details, but almost always one of them is troublesome, since there are 65 distinct cases but only 64 valid shift amounts, if we're talking about uint64_t).

You could count leading zeroes using std::countl_zero and create a bitmask that your bitwise OR with the original value:
#include <bit>
#include <climits>
#include <type_traits>
template<class T>
requires std::is_unsigned_v<T>
T leading_ones(T v) {
auto lz = std::countl_zero(v);
return lz ? v | ~T{} << (CHAR_BIT * sizeof v - lz) : v;
}
If you have a std::uint16_t, like
0b0001001101011111
then ~T{} is 0b1111111111111111, CHAR_BIT * sizeof v is 16 and countl_zero(v) is 3. Left shift 0b1111111111111111 16-3 steps:
0b1110000000000000
Bitwise OR with the original:
0b0001001101011111
| 0b1110000000000000
--------------------
= 0b1111001101011111

Your reverse is extremely slow! With an N-bit int you need N iterations to reverse, each at least 6 instructions, then at least 2 instructions to set the trailing bits, and finally N iterations to reverse the value again. OTOH even the simplest leading zero count needs only N iterations, then set the leading bits directly:
template<typename T>
inline constexpr T trivial_ilog2(T x) // Slow, don't use this
{
if (x == 0) return 0;
size_t c{};
while(x)
{
x >>= 1;
c += 1u;
}
return c;
}
template<typename T>
inline constexpr T set_leading_zeros(T x)
{
if (std::make_unsigned_t(x) >> (sizeof(T) * CHAR_BIT - 1)) // top bit is set
return x;
return x | (-T(1) << trivial_ilog2(x));
}
x = set_leading_zeros(x);
There are many other ways to count leading zero/get integer logarithm much faster. One of the methods involves doing in steps of powers of 2 like how to create the mask in harold's answer:
What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
Find the highest order bit in C
How to efficiently count the highest power of 2 that is less than or equal to a given number?
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup
But since you're targeting a specific target instead of doing something cross-platform and want to squeeze every bit of performance, there are almost no reasons to use pure standard features since these usecases typically need platform-specific code. If intrinsics are available you should use it, for example in modern C++ there's std::countl_zero but each compiler already has intrinsics to do that which will map to the best instruction sequence for that platform, for example _BitScanReverse or __builtin_clz
If intrinsics aren't available of if the performance is still not enough then try a lookup table. For example here's a solution with 256-element log table
static const char LogTable256[256] =
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};
uint16_t lut_ilog2_16(uint16_t x)
{
uint8_t h = x >> 8;
if (h) return LogTable256[h] + 8;
else return LogTable256[x & 0xFF];
}
In set_leading_zeros just call lut_ilog2_16 like above
The even better solution than a log table is a mask table so that you can get the mask directly instead of calculating 1 << LogTable256[x]
static const char MaskTable256[256] =
{
0xFF, 0xFE, 0xFC...
}
Some other notes:
1uz isn't a valid suffix in C++ prior to C++23
Don't use references for small types that fit in a single integer. That's not necessary and is usually slower when not inlined. Just assign the result back from the function

(Work in progress, power just went out here; posting now to save my work.)
Crusty old x86 CPUs have very slow C++20 std::countl_zero / GNU C __builtin_clz (386 bsr = Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.) For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.
(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah / mov ah, 0 instead of shr ax, 8, since 286 and earlier didn't have a barrel shifter for constant-time shifts. But in C++, that's something for the compiler to figure out. Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)
8086 to 286 - no instruction available.
386: bsf/bsr: 10+3n cycles. Worst-case: 10+3*31 = 103c
486: bsf (16 or 32-bit registers): 6-42 cycles; bsr 7-104 cycles (1 cycle less for 16-bit regs).
P5 Pentium: bsf: 6-42 cycles (6-34 for 16-bit); bsr 7-71 cycles. (or 7-39 for 16-bit). Non-pairable.
Intel P6 and later: bsr/bsr: 1 uop with 1 cycle throughput, 3 cycle latency. (PPro / PII and later).
AMD K7/K8/K10/Bulldozer/Zen: bsf/bsr are slowish for a modern CPU. e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.
Intel Haswell / AMD K10 : lzcnt introduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before tzcnt and the rest of BMI1).
For an input of 0, they return the operand-size, so they fully implement C++20 std::countl_zero / countr_zero respectively, unlike bsr/bsf. (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents. Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)
On AMD, they're fast, single uop for lzcnt, with tzcnt taking one more (probably a bit-reverse to feed the lzcnt execution unit), so a nice win vs. bsf/bsr. This is why compilers typically use rep bsf when for countr_zero / __builtin_ctz, so it will run as tzcnt on CPUs that support it, but as bsf on older CPUs. They produce the same results for non-zero inputs, unlike bsr/lzcnt.
On Intel, same fast performance as bsf/bsr, even including the output dependency until Skylake fixed that; it's a true dependency for bsf/bsr, but false dependency for tzcnt/lzcnt and popcnt.
Fast algorithm with a bit-scan building block
But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t on a 64-bit machine. (Or maybe even moreso for uint64_t on a 32-bit machine, where each shift would require shifting bits across the gap.)
Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium. (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)
bsr (index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD. On even more recent x86, BMI1 lzcnt is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer). Also it works for an input of 0 (producing the type width aka operand size), unlike bsr which leaves the destination register unmodified.
I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask. ISO C++ doesn't guarantee that >> is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour. (For example, GNU C documents it.)
https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.
Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.
int set_leading_zeros(int x){
int lz = __builtin_clz(x|1); // clamp the lzcount to 31 at most
int tmp = (x<<lz); // shift out leading zeros, leaving a 1 (or 0 if x==0)
tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1); // set the MSB in case x==0
return tmp>>lz; // sign-extend with an arithmetic right shift.
}
#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
int32_t lz = _lzcnt_u32(x); // returns 0 to 32
uint32_t mask = _bzhi_u32(-1, lz); // handles all 33 possible values, producing 0 for lz=32
return x | ~mask;
}
On x86-64 you can
Combined with BMI2 shlx / sarx for single-uop variable-count shifts even on Intel CPUs.
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend. Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient.
Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg

How to vectorize range check during block copy?

I have the function below:
void CopyImageBitsWithAlphaRGBA(unsigned char *dest, const unsigned char *src, int w, int stride, int h,
unsigned char minredmask, unsigned char mingreenmask, unsigned char minbluemask, unsigned char maxredmask, unsigned char maxgreenmask, unsigned char maxbluemask)
{
auto pend = src + w * h * 4;
for (auto p = src; p < pend; p += 4, dest += 4)
{
dest[0] = p[0]; dest[1] = p[1]; dest[2] = p[2];
if ((p[0] >= minredmask && p[0] <= maxredmask) || (p[1] >= mingreenmask && p[1] <= maxgreenmask) || (p[2] >= minbluemask && p[2] <= maxbluemask))
dest[3] = 255;
else
dest[3] = 0;
}
}
What it does is it copies a 32 bit bitmap from one memory block to another, setting the alpha channel to fully transparent when the pixel color falls within a certain color range.
How do I make this use SSE/AVX in VC++ 2017? Right now it's not generating vectorized code. Failing an automatic way of doing it, what functions can I use to do this myself?
Because really, I'd imagine testing if bytes are in a range would be one of the most obviously useful operations possible, but I can't see any built in function to take care of it.

I don't think you're going to get a compiler to auto-vectorize as well as you can do by hand with Intel's intrinsics. (err, as well as I can do by hand anyway :P).
Possibly once we manually vectorize it, we can see how to hand-hold a compiler with scalar code that works that way, but we really need packed-compare into a 0/0xFF with byte elements, and it's hard to write something in C that compilers will auto-vectorize well. The default integer promotions mean that most C expressions actually produce 32-bit results, even when you use uint8_t, and that often tricks compilers into unpacking 8-bit to 32-bit elements, costing a lot of shuffles on top of the automatic factor of 4 throughput loss (fewer elements per register), like in #harold's small tweak to your source.
SSE/AVX (before AVX512) has signed comparisons for SIMD integer, not unsigned. But you can range-shift things to signed -128..127 by subtracting 128. XOR (add-without-carry) is slightly more efficient on some CPUs, so you actually just XOR with 0x80 to flip the high bit. But mathematically you're subtracting 128 from a 0..255 unsigned value, giving a -128..127 signed value.
It's even still possible to implement the "unsigned compare trick" of (x-min) < (max-min). (For example, detecting alphabetic ASCII characters). As a bonus, we can bake the range-shift into that subtract. If x<min, it wraps around and becomes a large value greater than max-min. This obviously works for unsigned, but it does in fact work (with a range-shifted max-min) with SSE/AVX2 signed-compare instructions. (A previous version of this answer claimed this trick only worked if max-min < 128, but that's not the case. x-min can't wrap all the way around and become lower than max-min, or get into that range if it started above max).
An earlier version of this answer had code that made the range exclusive, i.e. not including the ends, so you even redmin=0 / redmax=255 would exclude pixels with red=0 or red=255. But I solved that by comparing the other way (thanks to ideas from #Nejc's and #chtz's answers).
#chtz's idea of using a saturating add/sub instead of a compare is very cool. If you arrange things so saturation means in-range, it works for an inclusive range. (And you can set the Alpha component to a known value by choosing a min/max that makes all 256 possible inputs in-range). This lets us avoid range-shifting to signed, because unsigned-saturation is available
We can combine the sub/cmp range-check with the saturation trick to do sub (wraps on out-of-bounds low) / subs (only reaches zero if the first sub didn't wrap). Then we don't need an andnot or or to combine two separate checks on each component; we already have a 0 / non-zero result in one vector.
So it only takes two operations to give us a 32-bit value for the whole pixel that we can check. Iff all 3 RGB components are in-range, that element will have a specific value. (Because we've arranged for the Alpha component to already give a known value, too). If any of the 3 components are out-of-range, it will have some other value.
If you do this the other way, so saturation means out-of-range, then you have an exclusive range in that direction, because you can't choose a limit such that no value reaches 0 or reaches 255. You can always saturate the alpha component to give yourself a known value there, regardless of what it means for the RGB components. An exclusive range would let you abuse this function to be always-false by choosing a range that no pixel could ever match. (Or if there's a third condition, besides per-component min/max, then maybe you want an override).
The obvious thing would be to use a packed-compare instruction with 32-bit element size (_mm256_cmpeq_epi32 / vpcmpeqd) to generate a 0xFF or 0x00 (which we can apply / blend into the original RGB pixel value) for in/out of range.
// AVX2 core idea: wrapping-compare trick with saturation to achieve unsigned compare
__m256i tmp = _mm256_sub_epi8(src, min_values); // wraps to high unsigned if below min
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min); // unsigned saturation to 0 means in-range
__m256i new_alpha = _mm256_cmpeq_epi32(RGB_inrange, _mm256_setzero_si256());
// then blend the high byte of each element with RGB from the src vector
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, _mm256_set1_epi32(0x00FFFFFF)); // alpha from new_alpha, RGB from src
Note that an SSE2 version would only need one MOVDQA instructions to copy src; the same register is the destination for every instruction.
Also note that you could saturate the other direction: add then adds (with (256-max) and (256-(min-max)), I think) to saturate to 0xFF for in-range. This could be useful with AVX512BW if you use zero-masking with a fixed mask (e.g. for alpha) or variable mask (for some other condition) to exclude a component based on some other condition. AVX512BW zero-masking for the sub/subs version would consider components in-range even when they aren't, which could also be useful.
But extending that to AVX512 requires a different approach: AVX512 compares produce a bit-mask (in a mask register), not a vector, so we can't turn around and use the high byte of each 32-bit compare result separately.
Instead of cmpeq_epi32, we can produce the value we want in the high byte of each pixel using carry/borrow from a subtract, which propagates left to right.
0x00000000 - 1 = 0xFFFFFFFF # high byte = 0xFF = new alpha
0x00?????? - 1 = 0x00?????? # high byte = 0x00 = new alpha
Where ?????? has at least one non-zero bit, so it's a 32-bit number >=0 and <=0x00FFFFFFFF
Remember we choose an alpha range that makes the high byte always zero
i.e. _mm256_sub_epi32(RGB_inrange, _mm_set1_epi32(1)). We only need the high byte of each 32-bit element to have the alpha value we want, because we use a byte-blend to merge it with the source RGB values. For AVX512, this avoids a VPMOVM2D zmm1, k1 instruction to convert a compare result back into a vector of 0/-1, or (much more expensive) to interleave each mask bit with 3 zeros to use it for a byte-blend.
This sub instead of cmp has a minor advantage even for AVX2: sub_epi32 runs on more ports on Skylake (p0/p1/p5 vs. p0/p1 for pcmpgt/pcmpeq). On all other CPUs, vector integer add/sub run on the same ports as vector integer compare. (Agner Fog's instruction tables).
Also, if you compile _mm256_cmpeq_epi32() with -march=native on a CPU with AVX512, or otherwise enable AVX512 and then compile normal AVX2 intrinsics, some compilers will stupidly use AVX512 compare-into-mask and then expand back to a vector instead of just using the VEX-coded vpcmpeqd. Thus, we use sub instead of cmp even for the _mm256 intrinsics version, because I already spent the time to figure it out and show that it's at least as efficient in the normal case of compiling for regular AVX2. (Although _mm256_setzero_si256() is cheaper than set1(1); vpxor can zero a register cheaply instead of loading a constant, but this setup happens outside the loop.)
#include <immintrin.h>
#ifdef __AVX2__
// inclusive min and max
__m256i setAlphaFromRangeCheck_AVX2(__m256i src, __m256i mins, __m256i max_minus_min)
{
__m256i tmp = _mm256_sub_epi8(src, mins); // out-of-range wraps to a high signed value
// (x-min) <= (max-min) equivalent to:
// (x-min) - (max-min) saturates to zero
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);
// 0x00000000 for in-range pixels, 0x00?????? (some higher value) otherwise
// this has minor advantages over compare against zero, see full comments on Godbolt
__m256i new_alpha = _mm256_sub_epi32(RGB_inrange, _mm256_set1_epi32(1));
// 0x00000000 - 1 = 0xFFFFFFFF
// 0x00?????? - 1 = 0x00?????? high byte = new alpha value
const __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF); // blend mask
// without AVX512, the only byte-granularity blend is a 2-uop variable-blend with a control register
// On Ryzen, it's only 1c latency, so probably 1 uop that can only run on one port. (1c throughput).
// For 256-bit, that's 2 uops of course.
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, RGB_mask); // RGB from src, 0/FF from new_alpha
return alpha_replaced;
}
#endif // __AVX2__
Set up vector args for this function and loop over your array with _mm256_load_si256 / _mm256_store_si256. (Or loadu/storeu if you can't guarantee alignment.)
This compiles very efficiently (Godbolt Compiler explorer) with gcc, clang, and MSVC. (AVX2 version on Godbolt is good, AVX512 and SSE versions are still a mess, not all the tricks applied to them yet.)
;; MSVC's inner loop from a caller that loops over an array with it:
;; see the Godbolt link
$LL4#:
vmovdqu ymm3, YMMWORD PTR [rdx+rax*4]
vpsubb ymm0, ymm3, ymm7
vpsubusb ymm1, ymm0, ymm6
vpsubd ymm2, ymm1, ymm5
vpblendvb ymm3, ymm2, ymm3, ymm4
vmovdqu YMMWORD PTR [rcx+rax*4], ymm3
add eax, 8
cmp eax, r8d
jb SHORT $LL4#
So MSVC managed to hoist the constant setup after inlining. We get similar loops from gcc/clang.
The loop has 4 vector ALU instructions, one of which takes 2 uops. Total 5 vector ALU uops. But total fused-domain uops on Haswell/Skylake = 9 with no unrolling, so with luck this can run at 32 bytes (1 vector) per 2.25 clock cycles. It could come close to actually achieving that with data hot in L1d or L2 cache, but L3 or memory would be a bottleneck. With unrolling, it could maybe bottlenck on L2 cache bandwidth.
An AVX512 version (also included in the Godbolt link), only needs 1 uop to blend, and could run faster in vectors per cycle, thus more than twice as fast using 512-byte vectors.

This is one possible way to make this function work with SSE instructions. I used SSE instead of AVX because I wanted to keep the answer simple. Once you understand how the solution works, rewriting the function with AVX intrinsics should not be much of a problem though.
EDIT: please note that my approach is very similar to one by PeterCordes, but his code should be faster because he uses AVX. If you want to rewrite the function below with AVX intrinsics, change step value to 8.
void CopyImageBitsWithAlphaRGBA(
unsigned char *dest,
const unsigned char *src, int w, int stride, int h,
unsigned char minred, unsigned char mingre, unsigned char minblu,
unsigned char maxred, unsigned char maxgre, unsigned char maxblu)
{
char low = 0x80; // -128
char high = 0x7f; // 127
char mnr = *(char*)(&minred) - low;
char mng = *(char*)(&mingre) - low;
char mnb = *(char*)(&minblu) - low;
int32_t lowest = mnr | (mng << 8) | (mnb << 16) | (low << 24);
char mxr = *(char*)(&maxred) - low;
char mxg = *(char*)(&maxgre) - low;
char mxb = *(char*)(&maxblu) - low;
int32_t highest = mxr | (mxg << 8) | (mxb << 16) | (high << 24);
// SSE
int step = 4;
int sse_width = (w / step)*step;
for (int y = 0; y < h; ++y)
{
for (int x = 0; x < w; x += step)
{
if (x == sse_width)
{
x = w - step;
}
int ptr_offset = y * stride + x;
const unsigned char* src_ptr = src + ptr_offset;
unsigned char* dst_ptr = dest + ptr_offset;
__m128i loaded = _mm_loadu_si128((__m128i*)src_ptr);
// subtract 128 from every 8-bit int
__m128i subtracted = _mm_sub_epi8(loaded, _mm_set1_epi8(low));
// greater than top limit?
__m128i masks_hi = _mm_cmpgt_epi8(subtracted, _mm_set1_epi32(highest));
// lower that bottom limit?
__m128i masks_lo = _mm_cmplt_epi8(subtracted, _mm_set1_epi32(lowest));
// perform OR operation on both masks
__m128i combined = _mm_or_si128(masks_hi, masks_lo);
// are 32-bit integers equal to zero?
__m128i eqzer = _mm_cmpeq_epi32(combined, _mm_setzero_si128());
__m128i shifted = _mm_slli_epi32(eqzer, 24);
// EDIT: fixed a bug:
__m128 alpha_unmasked = _mm_and_si128(loaded, _mm_set1_epi32(0x00ffffff));
__m128i combined = _mm_or_si128(alpha_unmasked, shifted);
_mm_storeu_si128((__m128i*)dst_ptr, combined);
}
}
}
EDIT: as #PeterCordes stated in the comments, the code included a bug that is now fixed.

Based on #PeterCordes solution, but replacing the shift+compare by saturated subtract and adding:
// mins_compl shall be [255-minR, 255-minG, 255-minB, 0]
// maxs shall be [maxR, maxG, maxB, 0]
__m256i setAlphaFromRangeCheck(__m256i src, __m256i mins_compl, __m256i maxs)
{
__m256i in_lo = _mm256_adds_epu8(src, mins_compl); // is 255 iff src+mins_coml>=255, i.e. src>=mins
__m256i in_hi = _mm256_subs_epu8(src, maxs); // is 0 iff src - maxs <= 0, i.e., src <= maxs
__m256i inbounds_components = _mm256_andnot_si256(in_hi, in_lo);
// per-component mask, 0xff, iff (mins<=src && src<=maxs).
// alpha-channel is always (~src & src) == 0
// Use a 32-bit element compare to check that all 3 components are in-range
__m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);
__m256i inbounds = _mm256_cmpeq_epi32(inbounds_components, RGB_mask);
__m256i new_alpha = _mm256_slli_epi32(inbounds, 24);
// alternatively _mm256_andnot_si256(RGB_mask, inbounds) ?
// byte blends (vpblendvb) are at least 2 uops, and Haswell requires port5
// instead clear alpha and then OR in the new alpha (0 or 0xFF)
__m256i alphacleared = _mm256_and_si256(src, RGB_mask); // off the critical path
__m256i new_alpha_applied = _mm256_or_si256(alphacleared, new_alpha);
return new_alpha_applied;
}
This saves on vpxor (no modification of src required) and one vpand (the alpha-channel is automatically 0 -- I guess that would be possible with Peter's solution as well by choosing the boundaries accordingly).
Godbolt-Link, apparently, neither gcc nor clang think it is worthwhile to re-use RGB_mask for both usages ...
Simple testing with SSE2 variant: https://wandbox.org/permlink/eVzFHljxfTX5HDcq (you can play around with the source and the boundaries)

Fastest way to downcast an array short to char

I have to process roughly 2000, 100 element arrays every second. The arrays come to me as shorts, w/ the data in the upper bits and need to be shifted and cast to chars. Is this as efficient as I can get, or is there a faster way to perform this operation? (I have to skip 2 of the values)
for(int i = 0; i < 48; i++)
{
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}

Even if shift and bitwise operation are fast, you can try to process the short array as a char pointer as other advised in comments. It is allowed per standard and for common architectures does what is expected - left the endianness problem.
So you could try to first determine your endianness:
bool isBigEndian() {
short i = 1; // sets only lowest order bit
char *ix = reinterpret_cast<char *>(&i);
return (*ix == 0); // will be 1 if little endian
}
Your loop now becomes:
int shft = isBigEndian()? 0 : 1;
char * pb = reinterpret_cast<char *>(b);
for(int i = 0; i < 48; i++)
{
a[i] = pt[2 * i + shft];
a[i+48] = pt[2 * i + 50 + shft];
}
But as always for low level optimisation, this has to be benchmarked with the compiler and compiler options that will be used in production code.

You could put a wrapper class around these arrays, so code that accesses elements of the wrapper in order actually accesses every other byte of the underlying memory.
This will probably defeat auto-vectorization, though. Other than that, having all the code that would read a actually read b and increment its pointers by two instead of one shouldn't change the cost at all.
The two skipped elements are a problem, though. Having your operator[] do if (i>=48) i+=2 might kill this idea. memmove will often be much faster than storing one byte at a time, so you could consider using memmove to make a contiguous array of shorts that you can index even though it seems silly to copy without storing in a better format.
The trick will be to write a wrapper that completely optimizes away to no extra instructions in loops over your arrays. This is possible on x86, where scaled indexing is available in normal effective-addresses in asm instructions, so if the compiler understands what's going on, it can make code that's just as efficient.
Having arrays of shorts does take twice as much memory, so cache effects could matter.
It all depends on what you need to do with the byte arrays.
If you do need to convert, use SIMD
For x86 targets, you can get a big speedup with SIMD vectors instead of looping one char at a time. For other compile targets you care about, you can write similar special versions. I assume ARM NEON has similar shuffling capability, for example.
When writing a platform-specific version, you also get to make all the endian and unaligned-access assumptions that are true on that platform.
#ifdef __SSE2__ // will be true for all x86-64 builds and most i386 builds
#include <immintrin.h>
static __m128i pack2(const short *p) {
__m128i lo = _mm_loadu_si128((__m128i*)p);
__m128i hi = _mm_loadu_si128((__m128i*)(p + 8));
lo = _mm_srli_epi16(lo, 8); // logical shift, not arithmetic, because we need the high byte to be zero
hi = _mm_srli_epi16(hi, 8);
return _mm_packus_epi16(lo, hi); // treats input as signed, saturates to unsigned 0x0 .. 0xff range
}
#endif // SSE2
void conv(char *a, const short *b) {
#ifdef __SSE2__
for(int i = 0; i < 48; i+=16) {
__m128i low = pack2(b+i);
_mm_storeu_si128((__m128i *)(a+i), low);
__m128i high = pack2(b+i + 50);
_mm_storeu_si128((__m128i *)(a+i + 48), high);
}
#else
/******* Fallback C version *******/
for(int i = 0; i < 48; i++) {
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
#endif
}
As you can see on the Godbolt Compiler Explorer, gcc fully unrolls the loop since it's only a few iterations when storing 16B at a time.
This should perform ok, but on pre-Skylake will bottleneck on shifting both vectors of shorts before the store. Haswell can only sustain one psrli per clock. (Skylake can sustain one per 0.5c when the shift-count is an immediate. See Agner Fog's guide and insn tables, links at the x86 tag wiki.)
You might get better results from loading from (__m128i*)(1 + (char*)p) so the bytes we want are already in the low half of each 16bit element. We'd still have to mask off the high half of each element with _mm_and_si128 instead of shifting, but PAND can run on any vector execution port, so it has three per clock throughput.
More importantly, with AVX it can be combined with an unaligned load. e.g. vpand xmm0, xmm5, [rsi], where xmm5 is a mask of _mm_set1_epi16(0x00ff), and [rsi] holds 2*i + 1 + (char*)b. fused-domain uop throughput is probably going to be an issue, like is common for code with a lot of loads/stores as well as computation.
Unaligned accesses are slightly slower than aligned accesses, but at least half your vector accesses will be unaligned anyway (since skipping two shorts means skipping 4B). On Intel SnB-family CPUs, I don't think it's slower to have loads that are split across a cache-line boundary in a 15:1 split compared to a 12:4 split. (The no-split case is definitely faster, though.) If b is 16B-aligned, then it'll be worth testing the mask version against the shift version.
I didn't write up complete code for this version, because you'll end up reading one byte past the end of b unless you take special precautions. This is fine if you make sure b has padding of some sort so it doesn't go right to the end of a memory page.
AVX2
With AVX2, vpackuswb ymm operates in two separate lanes. IDK if there's anything to gain from doing the load and mask (or shift) on 256b vectors and then using a vextracti128 and 128b pack on the two halves of the 256b vector.
Or maybe do a 256b pack between two vectors and then a vpermq (_mm256_permute4x64_epi64) to sort things out:
lo = _mm256_loadu(b..); // { b[15..8] | b[7..0] }
hi = // { b[31..24] | b[23..16] }
// mask or shift
__m256i packed = _mm256_packus_epi16(lo, hi); // [ a31..24 a15..8 | a23..16 a7..0 ]
packed = _mm256_permute4x64_epi64(packed, _MM_SHUFFLE(3, 1, 2, 0));
Of course, use any portable optimizations you can in the C version. e.g. Serge Ballesta's suggestion of just copying the desired bytes after figuring out their location from the endianness of the machine. (Preferably at compile time by checking GNU C's __BYTE_ORDER__ macro.

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.

This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.

Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?

The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);

Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.

I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}

To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)

unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}

Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js