'memcpy'-like function that supports offsets by individual bits?

'memcpy'-like function that supports offsets by individual bits? - c++

I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.
void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
{
// Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer
// needs to be up to one byte longer than it would need to be in memcpy.
// Maybe explicitly providing the end of the buffer is best.
// Also note that pSrc has NO alignment assumptions at all.
}
My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.
Memory should be moved in word sized blocks whenever possible
Alignment of these word sized blocks is important. pSrc is unaligned, so we may need to read a few bytes off the front until it is aligned.
Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?
Edit: It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.

I would start with a simple implementation such as this:
inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
{
if (srcBitOffset == 0)
{
for (size_t i = 0; i < size; ++i)
{
pDest[i] = pSrc[i];
}
}
else if (size > 0)
{
uint8_t v0 = pSrc[0];
for (size_t i = 0; i < size; ++i)
{
uint8_t v1 = pSrc[i + 1];
pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
v0 = v1;
}
}
}
(warning: untested code!).
Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.
Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.
Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.

I think that trivial byte-by-byte solution (see #PaulR's answer) is the best approach for small blocks, unless you can satisfy the following additional constraints:
Input buffer is allocated with some padding, i.e. accessing some bytes after the last one does not crash.
Output buffer is also allocated with some padding, and it does not matter if a few bytes after the desired result location are overwritten. If it does matter, that you'll need to do more stuff to preserve those after-the-end bytes.
Input and output ranges involved do not overlap (including a few more padding bytes after the end), just like in memcpy.
If you can, then it is possible to increase granularity of the algorithm. It is very easy to change #PaulR's answer to use uint64_t words instead of uint8_t bytes everywhere. As a result, it would work faster.
We can use SSE to further increase word size. Since in SSE there is no way to shift the whole register by bits, we have to do two shifts for 64-bit integers, then glue results together. Gluing is done by _mm_shuffle_epi8 from SSSE3, which allows to shuffle bytes in XMM register in arbitrary way. For shifting we use _mm_srl_epi64, because that's the only way to shift 64-bit integers by non-immediate number of bits. I have added restrict keyword from C (as macro) to the pointer arguments, because if they are aliased, the algorithm will not work anyway.
Here is the code:
void OffsetMemCpy_stgatilov(uint8_t *RESTRICT pDest, const uint8_t *RESTRICT pSrc, const uint8_t srcBitOffset, const size_t size) {
__m128i bits = (sizeof(size_t) == 8 ? _mm_cvtsi64_si128(srcBitOffset) : _mm_cvtsi32_si128(srcBitOffset));
const uint8_t *pEnd = pSrc + size;
while (pSrc < pEnd) {
__m128i input = _mm_loadu_si128((__m128i*)pSrc);
__m128i reg = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 12, 13, 14));
__m128i shifted = _mm_srl_epi64(reg, bits);
__m128i comp = _mm_shuffle_epi8(shifted, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, -1, -1));
_mm_storeu_si128((__m128i*)pDest, comp);
pSrc += 14; pDest += 14;
}
}
It processes 14 bytes per iteration. Each iteration is rather simple, also there is some code before the loop. Here is the assembly code of the whole function body as generated by MSVC2013 x64:
movzx eax, r8b
movd xmm3, rax
lea rax, QWORD PTR [rdx+r9]
cmp rdx, rax
jae SHORT $LN1#OffsetMemC
movdqa xmm1, XMMWORD PTR __xmm#0e0d0c0b0a0908070706050403020100
movdqa xmm2, XMMWORD PTR __xmm#ffff0e0d0c0b0a090806050403020100
sub rcx, rdx
npad 11
$LL2#OffsetMemC:
movdqu xmm0, XMMWORD PTR [rdx]
add rdx, 14
pshufb xmm0, xmm1
psrlq xmm0, xmm3
pshufb xmm0, xmm2
movdqu XMMWORD PTR [rcx+rdx-14], xmm0
cmp rdx, rax
jb SHORT $LL2#OffsetMemC
$LN1#OffsetMemC:
ret 0
IACA says the whole function takes 4.5 cycles throughput and 13 cycles latency on Ivy Bridge, given that the loop is executed once and no issues with caches/branches/decoding happen. In benchmark however, 7.5 cycles are spent on one such call on average.
Here are brief results of throughput benchmark on Ivy Bridge 3.4 Ghz (see more results in the code):
(billions of calls per second)
size = 4:
0.132 (Paul R)
0.248 (Paul R x64)
0.45 (stgatilov)
size = 8:
0.0782 (Paul R)
0.249 (Paul R x64)
0.45 (stgatilov)
size = 12:
0.0559 (Paul R)
0.191 (Paul R x64)
0.453 (stgatilov)
Note however, that in real world performance can be drastically different from benchmark results.
Full code with benchmarking and more verbose results are here.

Related

How to concatenate the low 3 elements from two 256-bit vectors in a 512-bit vector, and insert a scalar?

Even though it is odd and non-canonical, I would like to concatenate two __m256d and a double in a single __m512d. Specifically, I have
__m256d a = _mm256_set_pd(1, 2, 3, 0);
__m256d b = _mm256_set_pd(4, 5, 6, 0);
double c = 7;
At the end, I would like to have
__m512d d {1, 2, 3, 4, 5, 6, 7, 0}
Is there a fast way of doing this with Intel intrinsics?

See this link for an example: https://community.intel.com/t5/Intel-ISA-Extensions/How-to-convert-two-m256d-to-one-m512d-using-intrinsics/m-p/1062934
__m256d a;
__m256d b;
__m512d c = _mm512_insertf64x4(_mm512_castpd256_pd512(a), b, 1);

I suspect you meant that your a and b vectors had a 0.0 as the high element, not the low element. So probably you meant _mm256_setr_pd, not set. I'm going to assume that for the rest of the question.
You said your data comes from loads of 3-element arrays written by an external library. (You actually said "3-dimensional" arrays, but that would imply arr[][][].) So maybe you used masked loads to get the 4th element zeroed without reading past the end. Ideally you could make an output buffer for your external library like alignas(64) double out[8] and pass out+0 and out+3 as pointers so the data you want would already be contiguous, no shuffle needed. Just at worst a store-forwarding stall (extra latency but minimal throughput cost) if you reload it very soon after a narrower store.
But if that won't work, you can still do a masked load to get the b elements lined up where they belong, setting up for just a blend or merge-masking, using the unaligned-load hardware to essentially shuffle as part of the load, as mentioned in this answer. This has to be a 64-byte load, so it will definitely be a cache-line split unless the first element happens to start 3 doubles into a cache line.
Since you need to load from before the start of the actual array b comes from, you can use masking to make sure that's safe (fault suppression), although if it would have faulted without masking it can be quite slow. Instead of actually blending, it can be a merge-masked load into a.
__m512d merge_load(const double *array_A, const double *array_B, double c){
//alignas(32) double arraya[3];
//double arrayb[3];
__m256d a = _mm256_maskz_loadu_pd(0b0111, array_A); // zero masking load of 3 doubles at the bottom of a YMM. If your array actually has a 0.0 after, you don't need a mask, that's better.
// zero-extending to 512 is free because we just loaded, not like a function arg.
__m512d ab = _mm512_mask_loadu_pd(_mm512_zextpd256_pd512(a), 0b00111000, array_B-3);
// relies on the compiler to optimize away the useless zero-extension in _mm_set_sd
// clang is fine, GCC isn't. GCC insanely does vmovsd xmm1, xmm0,xmm0 merging into itself, not even a movq zero-extension
__m512d d = _mm512_mask_broadcastsd_pd(ab, 0b01000000, _mm_set_sd(c));
return d;
}
This might be slower than narrower loads that don't stall, followed by a couple shuffles. A cache-line split and a store-forwarding stall, if the data was stored recently, might be too much latency for out-of-order exec to hide, depending on surrounding code. And store-forwarding stalls don't pipeline with other SF stalls on Intel, although they do with other loads including fast-path store forwarding. A cache-line split load also has some throughput cost in terms of load-port cycles and split buffers.
Another option would be a masked 256-bit load to merge bc and set up for a vpermt2pd shuffle. Or an AVX2 vblendpd if you can load from array_B - 1 without masking for fault-suppression.
Clang compiles this as-written, with no instructions for the _mm_set_sd: only the low element of that __m128d is read, so it's pointless to actually zero-extend. (Unlike GCC, which wastes an instruction to do nothing useful, not actually even zero-extending, just merging the same value into garbage with vmovsd xmm1, xmm0, xmm0. Unlike movq, and unlike the load form, reg-reg movsd is a merge.)
This needs 3 separate mask constants, which take multiple uops each to set up. (e.g. mov-immediate + kmov. Or on Zen4, GCC chooses to load from memory. On Intel CPUs, kmov k, mem is still 2 uops, just like loading into an integer register. Hopefully it's cheaper on AMD.) A shuffle like vpermt2pd would work with a vector constant instead of a mask for the final step.
# clang15 -O3 -march=skylake-avx512
merge_load(double const*, double const*, double):
mov al, 7
kmovd k1, eax
vmovupd ymm1 {k1} {z}, ymmword ptr [rdi] # maskz_loadu
mov al, 56
kmovd k1, eax
vmovupd zmm1 {k1}, zmmword ptr [rsi - 24] # mask_loadu (merge-masking)
mov al, 64
kmovd k1, eax
vbroadcastsd zmm1 {k1}, xmm0 # mask_broadcastsd merge
vmovapd zmm0, zmm1
ret
If it picked 3 separate k1..7 registers, the masks could stick around, except across function calls that don't inline. I indented that work separately. But if you're calling an external function, there are no call-preserved k mask or YMM or ZMM regs, so probably a 64-byte vector constant is best if you have to reload every time; it can stay hot in cache, so just the uop count and port pressure matters.
Another strategy, if your a and b vectors are zero-extended to 512-bit, is one shuffle and one cheap vorpd that can run on any port. Unfortunately there's no vblendpd z,z,z, imm, which would be ideal, using an immediate control instead of a mask register. But there's only the AVX1 forms for XMM and YMM. (And legacy SSE4.1)
Fully avoiding vector and mask constants would be more expensive. You can vbroadcastsd y,x on c + AVX1 vblendpd ymm..., imm to create bc = {b0, b1, b2, c}, but then what? vinsertf64x4 can get everything into one register, but vpermpd zmm, zmm, imm8 does the same shuffle within two 256-bit halves. There aren't enough immediate bits in an imm8 for an 8-element shuffle. So maybe valignq (with itself) to rotate bc into place and vorpd instead of a blend to combine.
But a good tradeoff appears to be:
// Efficient if a is already zero-extended to 512
__m512d merge_low3_manual(__m256d a, __m256d b, double c)
{
const __m512i d_perm_idx = _mm512_setr_epi64(3, 3, 3, 0, 1, 2, 8, 3);
// 0 0 0 b0 b1 b2 c 0 (low element on the left)
__m512d bc = _mm512_permutex2var_pd(_mm512_castpd256_pd512(b), d_perm_idx, _mm512_castpd128_pd512(_mm_set_sd(c)));
//return _mm512_blend_pd(_mm512_castpd256_pd512(a), bc, 0b11111000); // there is no vblendpd z,z,z, imm8
return _mm512_or_pd(_mm512_zextpd256_pd512(a), bc); // a zero-extends for free if the compiler can see where it was created.
}
Silly clang, defeating mov-elimination by picking the same register for vmovapd ymm0, ymm0 when it can't optimize away zero-extension.
merge_low3_manual(double __vector(4), double __vector(4), double):
vmovapd zmm3, zmmword ptr [rip + .LCPI0_0] # zmm3 = [3,3,3,0,1,2,8,3]
vpermi2pd zmm3, zmm1, zmm2 # produce bc
vmovapd ymm0, ymm0 # zero extension hopefully goes away on inlining
vorpd zmm0, zmm3, zmm0
ret
So for real, this is probably just the 2 instructions (assuming your vector inputs have already been loaded as __m256d without bothering to do any combining as part of that. Which is probably fair, merge-masking would take extra instructions to set up masks. But if you need masks for fault-suppression, use merge-masking.)
These are all untested, so I may have the wrong shuffle constants.

How about this:
// Note: _mm256_set_pd lists elements from top to bottom, meaning in this case
// the lowest elements are zero.
__m256d a = _mm256_set_pd(1, 2, 3, 0);
__m256d b = _mm256_set_pd(4, 5, 6, 0);
double c = 7;
// Expected result is:
// __m512d d {1, 2, 3, 4, 5, 6, 7, 0}
// Note: Here the elements are listed from bottom to top, meaning that the
// last (i.e. upper) element is zero.
// Insert c into the lower element of b.
// We rely on that c is already in an xmm register and all upper elements are
// likely zero, so _mm_set_sd and _mm256_zextpd128_pd256 are likely optimized
// away. We also rely on that the lowest element of b is zero.
// If the lowest element of b is not zero, use _mm256_blend_pd here instead.
// If the order of elements in b is different, use _mm256_permutex2var_pd or,
// as suggested by Peter Cordes in the comments, _mm512_insertf64x2
// or _mm256_mask_broadcastsd_pd.
__m256d bc = _mm256_or_pd(b, _mm256_zextpd128_pd256(_mm_set_sd(c)));
// Merge and reorder elements of a and bc.
// We rely on that the lowest element of a is zero, which we move to the top
// element of d. If that is not the case and you still want zero in the top
// element of d, you can use _mm512_maskz_permutex2var_pd here with a mask
// of 0b01111111.
const __m512i d_perm_idx = _mm512_setr_epi64(3, 2, 1, 11, 10, 9, 8, 0);
__m512d d = _mm512_permutex2var_pd(
_mm512_castpd256_pd512(a), d_perm_idx, _mm512_castpd256_pd512(bc));
In case if your input vectors are actually reversed from your code snippet (i.e. the zeros are actually in the top elements), here is the updated code:
// Note the 'r' in _mm256_setr_pd
__m256d a = _mm256_setr_pd(1, 2, 3, 0);
__m256d b = _mm256_setr_pd(4, 5, 6, 0);
double c = 7;
// Insert c into the upper half of b.
__m512d bc = _mm512_insertf64x2(_mm512_castpd256_pd512(b), _mm_set_sd(c), 2);
// Merge and reorder elements of a and bc.
const __m512i d_perm_idx = _mm512_setr_epi64(0, 1, 2, 8, 9, 10, 12, 3);
__m512d d = _mm512_permutex2var_pd(_mm512_castpd256_pd512(a), d_perm_idx, bc);

C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number) [duplicate]

I have a huge memory block (bit-vector) with size N bits within one memory page, consider N on average is 5000, i.e. 5k bits to store some flags information.
At a certain points in time (super-frequent - critical) I need to find the first bit set in this whole big bit-vector. Now I do it per-64-word, i.e. with help of __builtin_ctzll). But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few words
There is a single assembly instruction called BSF that gives the position of the highest set bit (GCC's __builtin_ctzll()).
So in x86-64 arch I can find the highest bit set cheaply in 64-bit words.
But what about scaling through memory width?
E.g. is there a way to do it efficiently with 128 / 256 / 512 -bit registers?
Basically I'm interested in some C API function to achieve this, but also want to know what this method is based on.
UPD: As for CPU, I'm interested for this optimization to support the following CPU lineups:
Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807, Cortex-A53/72)
P.S. In mentioned algorithm before the final bit scan I need to sum k (in average 20-40) N-bit vectors with CPU AND (the AND result is just a preparatory stage for the bit-scan). This is also desirable to do with memory width scaling (i.e. more efficiently than per 64bit-word AND)
Read also: Find first set

This answer is in a different vein, but if you know in advance that you're going to be maintaining a collection of B bits and need to be able to efficiently set and clear bits while also figuring out which bit is the first bit set, you may want to use a data structure like a van Emde Boas tree or a y-fast trie. These data structures are designed to store integers in a small range, so instead of setting or clearing individual bits, you could add or remove the index of the bit you want to set/clear. They're quite fast - you can add or remove items in time O(log log B), and they let you find the smallest item in time O(1). Figure that if B ≈ 50000, then log log B is about 4.
I'm aware this doesn't directly address how to find the highest bit set in a huge bitvector. If your setup is such that you have to work with bitvectors, the other answers might be more helpful. But if you have the option to reframe the problem in a way that doesn't involve bitvector searching, these other data structures might be a better fit.

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.
Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/
Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)
The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.
(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)
AVX2 version:
This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.
Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.
(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)
// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes
#include <immintrin.h>
#include <stdint.h>
#include <string.h>
// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
//assert(len % 64 == 0);
//optimal if p is 64-byte aligned, so we're checking single cache-lines
const char *p_init = p;
const char *endp = p + len - 64;
do {
__m256i v1 = _mm256_loadu_si256((const __m256i*)p);
__m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
__m256i or = _mm256_or_si256(v1,v2);
if (!_mm256_testz_si256(or, or)){ // find the first non-zero cache line
__m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
__m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;
unsigned idx = __builtin_ctz(~zero_map); // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
uint32_t nonzero_chunk;
memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk)); // aliasing / alignment-safe load
return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
}
p += 64;
}while(p < endp);
return -1;
}
On Godbolt with clang 12 -O3 -march=haswell:
bitscan_avx2:
lea rax, [rdi + rsi]
add rax, -64 # endp
xor ecx, ecx
.LBB0_1: # =>This Inner Loop Header: Depth=1
vmovdqu ymm1, ymmword ptr [rdi] # do {
vmovdqu ymm0, ymmword ptr [rdi + 32]
vpor ymm2, ymm0, ymm1
vptest ymm2, ymm2
jne .LBB0_2 # if() goto out of the inner loop
add ecx, 512 # bit-counter incremented in the loop, for (p-p_init) * 8
add rdi, 64
cmp rdi, rax
jb .LBB0_1 # }while(p<endp)
mov eax, -1 # not-found return path
vzeroupper
ret
.LBB0_2:
vpxor xmm2, xmm2, xmm2
vpcmpeqd ymm1, ymm1, ymm2
vmovmskps eax, ymm1
vpcmpeqd ymm0, ymm0, ymm2
vmovmskps edx, ymm0
shl edx, 8
or edx, eax # mov ah,dl would be interesting, but compilers won't do it.
not edx # one_positions = ~zero_positions
xor eax, eax # break false dependency
tzcnt eax, edx # dword_idx
xor edx, edx
tzcnt edx, dword ptr [rdi + 4*rax] # p[dword_idx]
shl eax, 5 # dword_idx * 4 * CHAR_BIT
add eax, edx
add eax, ecx
vzeroupper
ret
This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).
To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.
Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.
But apparently a single array isn't your real problem.
Without AVX
16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)
However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.
Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.
2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].
With AVX-512:
You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:
integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.
compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.
kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1
ANDing multiple input arrays
You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.
You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.
AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)
(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)
L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

You may try this function, your compiler should optimize this code for your CPU. It's not super perfect, but it should be relatively quick and mostly portable.
PS length should be divisible by 8 for max speed
#include <stdio.h>
#include <stdint.h>
/* Returns the index position of the most significant bit; starting with index 0. */
/* Return value is between 0 and 64 times length. */
/* When return value is exact 64 times length, no significant bit was found, aka bf is 0. */
uint32_t offset_fsb(const uint64_t *bf, const register uint16_t length){
register uint16_t i = 0;
uint16_t remainder = length % 8;
switch(remainder){
case 0 : /* 512bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3] | bf[i+4] | bf[i+5] | bf[i+6] | bf[i+7]) break;
i += 8;
}
/* fall through */
case 4 : /* 256bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3]) break;
i += 4;
}
/* fall through */
case 6 : /* 128bit compare */
/* fall through */
case 2 : /* 128bit compare */
while(i < length){
if(bf[i] | bf[i+1]) break;
i += 2;
}
/* fall through */
default : /* 64bit compare */
while(i < length){
if(bf[i]) break;
i++;
}
}
register uint32_t offset_fsb = i * 64;
/* Check the last uint64_t if the last uint64_t is not 0. */
if(bf[i]){
register uint64_t s = bf[i];
offset_fsb += 63;
while(s >>= 1) offset_fsb--;
}
return offset_fsb;
}
int main(int argc, char *argv[]){
uint64_t test[16];
test[0] = 0;
test[1] = 0;
test[2] = 0;
test[3] = 0;
test[4] = 0;
test[5] = 0;
test[6] = 0;
test[7] = 0;
test[8] = 0;
test[9] = 0;
test[10] = 0;
test[11] = 0;
test[12] = 0;
test[13] = 0;
test[14] = 0;
test[15] = 1;
printf("offset_fsb = %d\n", offset_fsb(test, 16));
return 0;
}

Conditionally return only some results from a SIMD calculation, instead of all [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with:
g++ p14.cpp -o p14
Assembly, p14.asm:
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp:
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).
The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times
My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++ (unoptimized): avg 1272 ms.
g++ -O3: avg 578 ms.
asm (div) (original): avg 2650 ms.
asm (shr): avg 679 ms.
#johnfound asm (assembled with NASM): avg 501 ms.
#hidefromkgb asm: avg 200 ms.
#hidefromkgb asm, optimized by #Peter Cordes: avg 145 ms.
#Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).
gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
Helping the compiler
(summary for this case: use uint64_t n)
First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in #Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on #johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).
BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See #johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See #johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).
I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.
Untested idea for manual vectorization
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vpsllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
#EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. #Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for #Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
#hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in #hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.
#Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.
See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
include "%lib%/freshlib.inc"
#BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.
The output of both programs is the same: max sequence = 525 on i = 837799.

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by #ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
[even] [odd]
[odd] [even]
[even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.
Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
[EDIT:]
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
#main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
#loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA #loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA #main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
#itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE #itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE #itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
#outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE #outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:
But, this code never stops (because of integer overflow) !?! Yves Daoust
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.
Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).
How about some more complex number, which will create cycle without 0?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).
But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).
Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.
But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization
(https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.
On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}

The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.

Fastest way to unpack bits into single precision floats

This is platform specific question. Speed is crucial.
What is the fastest way to unpack a byte into an array of 8 single precision floats so that zeroes map into zeroes and ones map into ones?
I ended up using 8 bit masks and 7 bit shifts to unpack into 8 int32's and then an AVX instruction to convert int32's into floats.
My platform is Windows 64 bit running on AVX (but no AVX2) capable CPU. Compiler: Visual Studio 2013.
Thanks.

Wouldn't preprocessing be faster? 2^8 possibilities is pretty much, but then again, just split it into two parts, and it's only 2^4 = 16 variables.
Make array consiting of 16 "values", where each value is array filled with 4 floats with right values. Then your cost would be only 2 * (copy data from preprocessed array to new array).
I'm not too deep into assembly, but two copy's should be faster then some loops etc.
unsigned char myByte; // input byte (pattern to create floats)
float preprocessingArrays[16][4] = {
{ 0.0f, 0.0f, 0.0f, 0.0f }, // 0000
// ...
{ 1.0f, 1.0f, 1.0f, 1.0f } // 1111
};
float result[8];
std::memcpy(&result[0], &preprocessingArrays[myByte >> 4][0], 16);
std::memcpy(&result[4], &preprocessingArrays[myByte & 15][0], 16);
// 16 = platform-specific -> floats should be 32bits -> 4bytes * 4 floats = 16
This is written from hand, but as you can see mine loop would consists of two memcpys, one bitshift and one binary AND operation (or only one, but bigger, memcpy, if you want to make preprocessing for 2^8 values).
For C(++) only code i think this would beat loops etc. but assembler code might be faster, i'm not that sure. Maybe you could perform memcpy operation using assembler and in one go read whole 4 floats and then write it in another one call. AVX seems to support up to 16 256bits registers, so it might be possible to just calculate from which register (of 16 possible values) copy value where and this would be very fast.
Also not to write so much code yourself, just make simple program which would print preprocessing values for you, copy it and paste into original program :)

Loops, conditions and going through an actual array in memory are of course not the vector way. So here's an other idea, though it's a bit annoying in only AVX. Since without AVX2 you can do almost nothing with an ymm register (nothing useful anyway), just use two xmm registers and then in the end vinsertf128 the high part to form the whole thing. Mixing like this is OK as long as the operations on xmm registers use VEX encoded instructions (so 'v' goes in front of everything, even when it may seem unnecessary).
Anyway, the idea is to put a copy of the byte in every dword, AND with the right bit per lane and compare to form masks. In the end we can do a single bitwise AND to turn the masks into 0f or 1f.
So, first get that byte everywhere, let's say it's in eax, doesn't really matter:
vmovd xmm0, eax
vpshufd xmm0, xmm0, 0
Extract the right bits:
vpand xmm0, xmm0, [low_mask]
vpand xmm1, xmm0, [high_mask]
The masks are 1, 2, 4, 8 and 16, 32, 64, 128 (this is in memory order, if you use _mm_set_epi32 they have to be the other way around)
Compare to form the masks:
vpxor xmm2, xmm2, xmm2
vpcmpgtd xmm0, xmm0, xmm2
vpcmpgtd xmm1, xmm1, xmm2
Merge:
vinsertf128 ymm0, ymm0, xmm1, 1
Turn into 0f or 1f:
vandps ymm0, ymm0, [ones]
ones is just 1f duplicated 8 times.
I don't know if this is faster, but it's worth a try. Also, none of this was tested.
I tried to convert it to intrinsics, but I have no idea what I'm doing (and it's not tested). Also, be careful that it compiles with VEX prefixes, or it'll cause expensive mode-switching.
// broadcast
__m128i low = _mm_set1_epi32(mask);
__m128i high = _mm_set1_epi32(mask);
// extract bits
low = _mm_and_si128(low, _mm_set_epi32(8, 4, 2, 1));
high = _mm_and_si128(high, _mm_set_epi32(128, 64, 32, 16));
// form masks
low = _mm_cmpgt_epi32(low, _mm_setzero_si128());
high = _mm_cmpgt_epi32(high, _mm_setzero_si128());
// stupid no-op casts
__m256 low2 = _mm256_castps128_ps256(_mm_castsi128_ps(low));
__m128 high2 = _mm_castsi128_ps(high);
// merge
__m256 total = _mm256_insertf128_ps(low2, high2, 1);
// convert to 0f or 1f
total = _mm256_and_ps(total, _mm256_set1_ps(1.0f));
With GCC at least, that generates OK code. It uses vbroadcastss for the set1 (instead of the vpshufd's that I used), I'm not sure how good that idea is (it means it has to bounce that int through memory).
With AVX2 it can be much simpler:
__m256i x = _mm256_set1_epi32(mask);
x = _mm256_and_si256(x, _mm256_set_epi32(128, 64, 32, 16, 8, 4, 2, 1));
x = _mm256_cmpgt_epi32(x, _mm256_setzero_si256());
x = _mm256_and_si256(x, _mm256_set1_epi32(0x3F800000));
return _mm256_castsi256_ps(x);

void byteToFloat(const uint8_t byteIn,
float *const restrict floatOut)
{
floatOut[0]=(byteIn&0x01)?1.0f:0.0f;
floatOut[1]=(byteIn&0x02)?1.0f:0.0f;
floatOut[2]=(byteIn&0x04)?1.0f:0.0f;
floatOut[3]=(byteIn&0x08)?1.0f:0.0f;
floatOut[4]=(byteIn&0x10)?1.0f:0.0f;
floatOut[5]=(byteIn&0x20)?1.0f:0.0f;
floatOut[6]=(byteIn&0x40)?1.0f:0.0f;
floatOut[7]=(byteIn&0x80)?1.0f:0.0f;
}
In x86-64 architectures from both Intel and AMD, branch predication
may be performed through the use of conditional move operations
(cmove): a source operand is conditionally moved to the destination
operand depending on the value of a flag register.
http://en.wikipedia.org/wiki/Branch_predication

Indexing, as #RippeR suggests, is my first guess too.
My second guess is something like this:
switch(theChar){
break; case 0: result[0] = 0; ... result[7] = 0;
break; case 1: result[0] = 0; ... result[7] = 1;
...
break; case 255: result[0] = 1; ... result[7] = 1;
}
It's wordy code, but you could get the preprocessor to help you write it.
The reason this might be faster is the switch should turn into a jump table, and the moves should optimize pretty well.
ADDED: if you're wondering how the preprocessor could help, here's something:
#define FOO(x,i) result[i] = !!((x) & (1<<(i)))
#define BAR(x) break; case x: FOO(x,0);FOO(x,1); ... FOO(x,7)
switch(theChar){
BAR(0);
BAR(1);
...
BAR(255);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js