How to shift elements of an array using intel intrinsic

How to shift elements of an array using intel intrinsic - c++

I have an array of size 16 which is aligned to 64 byte boundary which I was trying to shift left by 1 index using intel intrinsics.
int history[16] __attribute__((aligned(64)))
for (std::size_t i = 0; i < 15; i++) {
history[i] = history[i + 1];
}
history[15] = 0;
This is the initial loop on which I want to use 512 bit wide vector instructions. Any way to to it with low latency intrinsics.

You have 2 good options for a single-uop lane-crossing shuffle, that you can use between a 512-bit load and store to shuffle the whole cache line. (vpsrldq would do 4 separate 128-bit right shifts so that's unfortunately not what you want.)
vpermd would need a vector control operand, and zero-masking to "shift" in a zero. So the compiler would need extra instructions to load the control vector, and to kmov a constant into a mask register.
valignd is a 32-bit granularity fully lane-crossing version of SSSE3 / AVX2 vpalignr. But it doesn't have any of that horrible AVX / AVX2 "in lane" behaviour where it does multiple separate 128-bit shuffles so it's actually usable to shift a whole 256 or 512-bit vector left or right by a constant number of dwords. You need either zero-masking or a zeroed vector to shift in zeros from. A zeroed vector is as cheap as a NOP to create on Intel CPUs.
(perf numbers from https://www.uops.info/table.html - valignd is 1 uop for port 5 on Skylake-AVX512, same as vpermd or even vpermt2d which could similarly grab a zero from another register.)
#include <immintrin.h>
alignas(16) int history[16]; // C++ has had portable syntax for alignment since C++11
// assumes aligned pointer input
void shift64_right_4bytes(int *arr) {
__m512i v = _mm512_load_si512(arr); // AVX512 load intrinsics conveniently take void*, not __m512i*
v = _mm512_alignr_epi32( _mm512_setzero_si512(), v, 1 ); // v = (0:v) >> 32bits
_mm512_store_si512(arr, v);
}
Compiles to this asm (Godbolt):
# GCC10.2 -O3 -march=skylake-avx512
shift64_right_4bytes(int*):
vpxor xmm0, xmm0, xmm0
valignd zmm0, zmm0, ZMMWORD PTR [rdi], 1
vmovdqa64 ZMMWORD PTR [rdi], zmm0
vzeroupper
ret
Obviously the vpxor-zeroing and vzeroupper overhead could be hoisted/sunk out of loops after inlining, if you had an outer loop around that loop you showed.
So the real ALU work is just 1 uop for port 5. Of course, if you wrote this array with narrower stores very recently, you could get a store-forwarding stall. Could still be worth it, just extra latency to load, doesn't actually stall the whole pipeline or out-of-order execution of independent work.
If the rest of your code doesn't use 512-bit vectors, you might want to avoid them here (SIMD instructions lowering CPU frequency)
2x 256-bit loads that overlap by one int might be good, then store them. i.e. a 15-byte memmove with the same strategy that glibc's memcpy / memmove uses for small copies. Then store a zero at the end.
// only needs AVX1
// With 64-byte aligned history, no load or store crosses a cache-line boundary
void shift64_right_4bytes_256b(int *history) {
__m256i v0 = _mm256_loadu_si256((const __m256i*)(history+1));
__m256i v1 = _mm256_load_si256((const __m256i*)(history+8));
_mm256_store_si256((__m256i*)history, v0);
_mm256_storeu_si256((__m256i*)(history+7), v1); // overlap by 1 dword
history[15] = 0;
}
Or maybe valignd ymm for the high half, to shift a zero into the vector instead of a separate scalar store. (That would require AVX512VL instead of just AVX1 for this version, but that's fine on AXV512 CPUs.)
Partly depends on how you want to reload it, and whether the surrounding code does a lot of stores. (Back-end pressure on the store execution units and store buffer).
Or if it was originally stored with 2x 256-bit aligned stores, then the unaligned load could hit a store-forwarding stall which you could avoid by using valignd to shift a dword between the high and low halves, as well as to shift a zero into the high half.

Related

C++ std::countr_zero() in SIMD 128/256/512 (find position of least significant 1 bit in 128/256/512-bit number) [duplicate]

I have a huge memory block (bit-vector) with size N bits within one memory page, consider N on average is 5000, i.e. 5k bits to store some flags information.
At a certain points in time (super-frequent - critical) I need to find the first bit set in this whole big bit-vector. Now I do it per-64-word, i.e. with help of __builtin_ctzll). But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few words
There is a single assembly instruction called BSF that gives the position of the highest set bit (GCC's __builtin_ctzll()).
So in x86-64 arch I can find the highest bit set cheaply in 64-bit words.
But what about scaling through memory width?
E.g. is there a way to do it efficiently with 128 / 256 / 512 -bit registers?
Basically I'm interested in some C API function to achieve this, but also want to know what this method is based on.
UPD: As for CPU, I'm interested for this optimization to support the following CPU lineups:
Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807, Cortex-A53/72)
P.S. In mentioned algorithm before the final bit scan I need to sum k (in average 20-40) N-bit vectors with CPU AND (the AND result is just a preparatory stage for the bit-scan). This is also desirable to do with memory width scaling (i.e. more efficiently than per 64bit-word AND)
Read also: Find first set

This answer is in a different vein, but if you know in advance that you're going to be maintaining a collection of B bits and need to be able to efficiently set and clear bits while also figuring out which bit is the first bit set, you may want to use a data structure like a van Emde Boas tree or a y-fast trie. These data structures are designed to store integers in a small range, so instead of setting or clearing individual bits, you could add or remove the index of the bit you want to set/clear. They're quite fast - you can add or remove items in time O(log log B), and they let you find the smallest item in time O(1). Figure that if B ≈ 50000, then log log B is about 4.
I'm aware this doesn't directly address how to find the highest bit set in a huge bitvector. If your setup is such that you have to work with bitvectors, the other answers might be more helpful. But if you have the option to reframe the problem in a way that doesn't involve bitvector searching, these other data structures might be a better fit.

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.
Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/
Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)
The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.
(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)
AVX2 version:
This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.
Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.
(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)
// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes
#include <immintrin.h>
#include <stdint.h>
#include <string.h>
// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
//assert(len % 64 == 0);
//optimal if p is 64-byte aligned, so we're checking single cache-lines
const char *p_init = p;
const char *endp = p + len - 64;
do {
__m256i v1 = _mm256_loadu_si256((const __m256i*)p);
__m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
__m256i or = _mm256_or_si256(v1,v2);
if (!_mm256_testz_si256(or, or)){ // find the first non-zero cache line
__m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
__m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;
unsigned idx = __builtin_ctz(~zero_map); // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
uint32_t nonzero_chunk;
memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk)); // aliasing / alignment-safe load
return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
}
p += 64;
}while(p < endp);
return -1;
}
On Godbolt with clang 12 -O3 -march=haswell:
bitscan_avx2:
lea rax, [rdi + rsi]
add rax, -64 # endp
xor ecx, ecx
.LBB0_1: # =>This Inner Loop Header: Depth=1
vmovdqu ymm1, ymmword ptr [rdi] # do {
vmovdqu ymm0, ymmword ptr [rdi + 32]
vpor ymm2, ymm0, ymm1
vptest ymm2, ymm2
jne .LBB0_2 # if() goto out of the inner loop
add ecx, 512 # bit-counter incremented in the loop, for (p-p_init) * 8
add rdi, 64
cmp rdi, rax
jb .LBB0_1 # }while(p<endp)
mov eax, -1 # not-found return path
vzeroupper
ret
.LBB0_2:
vpxor xmm2, xmm2, xmm2
vpcmpeqd ymm1, ymm1, ymm2
vmovmskps eax, ymm1
vpcmpeqd ymm0, ymm0, ymm2
vmovmskps edx, ymm0
shl edx, 8
or edx, eax # mov ah,dl would be interesting, but compilers won't do it.
not edx # one_positions = ~zero_positions
xor eax, eax # break false dependency
tzcnt eax, edx # dword_idx
xor edx, edx
tzcnt edx, dword ptr [rdi + 4*rax] # p[dword_idx]
shl eax, 5 # dword_idx * 4 * CHAR_BIT
add eax, edx
add eax, ecx
vzeroupper
ret
This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).
To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.
Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.
But apparently a single array isn't your real problem.
Without AVX
16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)
However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.
Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.
2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].
With AVX-512:
You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:
integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.
compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.
kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1
ANDing multiple input arrays
You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.
You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.
AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)
(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)
L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

You may try this function, your compiler should optimize this code for your CPU. It's not super perfect, but it should be relatively quick and mostly portable.
PS length should be divisible by 8 for max speed
#include <stdio.h>
#include <stdint.h>
/* Returns the index position of the most significant bit; starting with index 0. */
/* Return value is between 0 and 64 times length. */
/* When return value is exact 64 times length, no significant bit was found, aka bf is 0. */
uint32_t offset_fsb(const uint64_t *bf, const register uint16_t length){
register uint16_t i = 0;
uint16_t remainder = length % 8;
switch(remainder){
case 0 : /* 512bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3] | bf[i+4] | bf[i+5] | bf[i+6] | bf[i+7]) break;
i += 8;
}
/* fall through */
case 4 : /* 256bit compare */
while(i < length){
if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3]) break;
i += 4;
}
/* fall through */
case 6 : /* 128bit compare */
/* fall through */
case 2 : /* 128bit compare */
while(i < length){
if(bf[i] | bf[i+1]) break;
i += 2;
}
/* fall through */
default : /* 64bit compare */
while(i < length){
if(bf[i]) break;
i++;
}
}
register uint32_t offset_fsb = i * 64;
/* Check the last uint64_t if the last uint64_t is not 0. */
if(bf[i]){
register uint64_t s = bf[i];
offset_fsb += 63;
while(s >>= 1) offset_fsb--;
}
return offset_fsb;
}
int main(int argc, char *argv[]){
uint64_t test[16];
test[0] = 0;
test[1] = 0;
test[2] = 0;
test[3] = 0;
test[4] = 0;
test[5] = 0;
test[6] = 0;
test[7] = 0;
test[8] = 0;
test[9] = 0;
test[10] = 0;
test[11] = 0;
test[12] = 0;
test[13] = 0;
test[14] = 0;
test[15] = 1;
printf("offset_fsb = %d\n", offset_fsb(test, 16));
return 0;
}

Why does not AVX further improve the performance compared with SSE2?

I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.
#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>
void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void normal(float* a, float* b, float* c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void sse(float* a, float* b, float* c, unsigned long N) {
__m128* a_ptr = (__m128*)a;
__m128* b_ptr = (__m128*)b;
for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
__m128 asqrt = _mm_sqrt_ps(*a_ptr);
__m128 bsqrt = _mm_sqrt_ps(*b_ptr);
__m128 add_result = _mm_add_ps(asqrt, bsqrt);
_mm_store_ps(&c[n], add_result);
}
}
void avx(float* a, float* b, float* c, unsigned long N) {
__m256* a_ptr = (__m256*)a;
__m256* b_ptr = (__m256*)b;
for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
__m256 asqrt = _mm256_sqrt_ps(*a_ptr);
__m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
__m256 add_result = _mm256_add_ps(asqrt, bsqrt);
_mm256_store_ps(&c[n], add_result);
}
}
int main(int argc, char** argv) {
unsigned long N = 1 << 30;
auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
std::chrono::time_point<std::chrono::system_clock> start, end;
for (unsigned long i = 0; i < N; ++i) {
a[i] = 3141592.65358;
b[i] = 1234567.65358;
}
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal(a, b, c, N);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal_res(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
sse(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
avx(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
return 0;
}
I compile my program by using g++ complier as the following.
g++ -msse -msse2 -mavx -mavx512f -O2
The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.
normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302
I have two questions.
Why does not AVX give me further improvement? Is it because the memory bandwidth?
According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.

There are several issues here....
Memory bandwidth is very likely to be important for these array sizes -- more notes below.
Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)
Memory Bandwidth Notes:
With N = 1<<30 and float variables, each array is 4GiB.
Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.
Instruction Throughput Notes:
The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.

Scalar being 10x instead of 4x slower:
You're getting page faults in c[] inside the scalar timed region because that's the first time you're writing it. If you did tests in a different order, whichever one was first would pay that large penalty. That part is a duplicate of this mistake: Why is iterating though `std::vector` faster than iterating though `std::array`? See also Idiomatic way of performance evaluation?
normal pays this cost in its first of the 5 passes over the array. Smaller arrays and a larger repeat count would amortize this even more, but better to memset or otherwise fill your destination first to pre-fault it ahead of the timed region.
normal_res is also scalar but is writing into an already-dirtied c[]. Scalar is 8x slower than SSE instead of the expected 4x.
You used sqrt(double) instead of sqrtf(float) or std::sqrt(float). On Skylake-X, this perfectly accounts for an extra factor of 2 throughput. Look at the compiler's asm output on the Godbolt compiler explorer (GCC 7.4 assuming the same system as your last question). I used -mavx512f (which implies -mavx and -msse), and no tuning options, to hopefully get about the same code-gen you did. main doesn't inline normal_res, so we can just look at the stand-alone definition for it.
normal_res(float*, float*, float*, unsigned long):
...
vpxord zmm2, zmm2, zmm2 # uh oh, 512-bit instruction reduces turbo clocks for the next several microseconds. Silly compiler
# more recent gcc would just use `vpxor xmm0,xmm0,xmm0`
...
.L5: # main loop
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rdi+rbx*4] # convert to double
vucomisd xmm2, xmm0
vsqrtsd xmm1, xmm1, xmm0 # scalar double sqrt
ja .L16
.L3:
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rsi+rbx*4]
vucomisd xmm2, xmm0
vsqrtsd xmm3, xmm3, xmm0 # scalar double sqrt
ja .L17
.L4:
vaddsd xmm1, xmm1, xmm3 # scalar double add
vxorps xmm4, xmm4, xmm4
vcvtsd2ss xmm4, xmm4, xmm1 # could have just converted in-place without zeroing another destination to avoid a false dependency :/
vmovss DWORD PTR [rdx+rbx*4], xmm4
add rbx, 1
cmp rcx, rbx
jne .L5
The vpxord zmm only reduces turbo clock for a few milliseconds (I think) at the start of each call to normal and normal_res. It doesn't keep using 512-bit operations so clock speed can jump back up again later. This might partially account for it not being exactly 8x.
The compare / ja is because you didn't use -fno-math-errno so GCC still calls actual sqrt for inputs < 0 to get errno set. It's doing if (!(0 <= tmp)) goto fallback, jumping on 0 > tmp or unordered. "Fortunately" sqrt is slow enough that it's still the only bottleneck. Out-of-order exec of the conversion and compare/branching means the SQRT unit is still kept busy ~100% of the time.
vsqrtsd throughput (6 cycles) is 2x slower than vsqrtss throughput (3 cycles) on Skylake-X, so using double costs a factor of 2 in scalar throughput.
Scalar sqrt on Skylake-X has the same throughput as the corresponding 128-bit ps / pd SIMD version. So 6 cycles per 1 number as a double vs. 3 cycles per 4 floats as a ps vector fully explains the 8x factor.
The extra 8x vs. 10x slowdown for normal was just from page faults.
SSE vs. AVX sqrt throughput
128-bit sqrtps is sufficient to get the full throughput of the SIMD div/sqrt unit; assuming this is a Skylake-server like your last question, it's 256 bits wide but not fully pipelined. The CPU can alternate sending a 128-bit vector into the low or high half to take advantage of the full hardware width even when you're only using 128-bit vectors. See Floating point division vs floating point multiplication (FP div and sqrt run on the same execution unit.)
See also instruction latency/throughput numbers on https://uops.info/, or on https://agner.org/optimize/.
The add/sub/mul/fma are all 512-bits wide and fully pipelined; use that (e.g. to evaluate a 6th order polynomial or something) if you want something that can scale with vector width. div/sqrt is a special case.
You'd expect a benefit from using 256-bit vectors for SQRT only if you had a bottleneck on the front-end (4/clock instruction / uop throughput), or if you were doing a bunch of add/sub/mul/fma work with the vectors as well.
256-bit isn't worse, but it doesn't help when the only computation bottleneck is on the div/sqrt unit's throughput.
See John McCalpin's answer for more details about write-only costing about the same as a read+write, because of RFOs.
With so little computation per memory access, you're probably close to bottlenecking on memory bandwidth again / still. Even if the FP SQRT hardware was wider / faster, you might not in practice have your code run any faster. Instead you'd just have the core spend more time doing nothing while waiting for data to arrive from memory.
It seems you are getting exactly the expected speedup from 128-bit vectors (2x * 4x = 8x), so apparently the __m128 version is not bottlenecked on memory bandwidth either.
2x sqrt per 4 memory accesses is about the same as the a[i] = sqrt(a[i]) (1x sqrt per load + store) you were doing in the code you posted in chat, but you didn't give any numbers for that. That one avoided the page-fault problem because it was rewriting an array in-place after initializing it.
In general rewriting an array in-place is a good idea if you for some reason keep insisting on trying to get a 4x / 8x / 16x SIMD speedup using these insanely huge arrays that won't even fit in L3 cache.
Memory access is pipelined, and overlaps with computation (assuming sequential access so prefetchers can be pulling it in continuously without having to compute the next address): faster computation doesn't speed up overall progress. Cache lines arrive from memory at some fixed maximum bandwidth, with ~12 cache line transfers in flight at once (12 LFBs in Skylake). Or L2 "superqueue" can track more cache lines than that (maybe 16?), so L2 prefetch is reading ahead of where the CPU core is stalled.
As long as your computation can keep up with that rate, making it faster will just leave more cycles of doing nothing before the next cache line arrives.
(The store buffer writing back to L1d and then evicting dirty lines is also happening, but the basic idea of core waiting for memory still works.)
You could think of it like stop-and-go traffic in a car: a gap opens ahead of your car. Closing that gap faster doesn't gain you any average speed, it just means you have to stop faster.
If you want to see the benefit of AVX and AVX512 over SSE, you'll need smaller arrays (and a higher repeat-count). Or you'll need lots of ALU work per vector, like a polynomial.
In many real-world problems, the same data is used repeatedly so caches work. And it's possible to break up your problem into doing multiple things to one block of data while it's hot in cache (or even while loaded in registers), to increase the computational intensity enough to take advantage of the compute vs. memory balance of modern CPUs.

why is it faster to print number in binary using arithmetic instead of _bittest

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?

Not answer (Bo did), but the second inner loop version can be simplified a bit:
long numCopy = num;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = numCopy & 1;
numCopy >>= 1;
}
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))
Answer:
Why three cpu instructions run faster then two?
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...

I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer
$LL7#main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7#main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18#main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18#main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.
bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <-----
00007FF6D25110A7 lea rcx,[rcx+1]
00007FF6D25110AB setb al
00007FF6D25110AE inc ebx
00007FF6D25110B0 mov byte ptr [rcx-1],al
The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

Conditionally return only some results from a SIMD calculation, instead of all [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Fastest way to downcast an array short to char

I have to process roughly 2000, 100 element arrays every second. The arrays come to me as shorts, w/ the data in the upper bits and need to be shifted and cast to chars. Is this as efficient as I can get, or is there a faster way to perform this operation? (I have to skip 2 of the values)
for(int i = 0; i < 48; i++)
{
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}

Even if shift and bitwise operation are fast, you can try to process the short array as a char pointer as other advised in comments. It is allowed per standard and for common architectures does what is expected - left the endianness problem.
So you could try to first determine your endianness:
bool isBigEndian() {
short i = 1; // sets only lowest order bit
char *ix = reinterpret_cast<char *>(&i);
return (*ix == 0); // will be 1 if little endian
}
Your loop now becomes:
int shft = isBigEndian()? 0 : 1;
char * pb = reinterpret_cast<char *>(b);
for(int i = 0; i < 48; i++)
{
a[i] = pt[2 * i + shft];
a[i+48] = pt[2 * i + 50 + shft];
}
But as always for low level optimisation, this has to be benchmarked with the compiler and compiler options that will be used in production code.

You could put a wrapper class around these arrays, so code that accesses elements of the wrapper in order actually accesses every other byte of the underlying memory.
This will probably defeat auto-vectorization, though. Other than that, having all the code that would read a actually read b and increment its pointers by two instead of one shouldn't change the cost at all.
The two skipped elements are a problem, though. Having your operator[] do if (i>=48) i+=2 might kill this idea. memmove will often be much faster than storing one byte at a time, so you could consider using memmove to make a contiguous array of shorts that you can index even though it seems silly to copy without storing in a better format.
The trick will be to write a wrapper that completely optimizes away to no extra instructions in loops over your arrays. This is possible on x86, where scaled indexing is available in normal effective-addresses in asm instructions, so if the compiler understands what's going on, it can make code that's just as efficient.
Having arrays of shorts does take twice as much memory, so cache effects could matter.
It all depends on what you need to do with the byte arrays.
If you do need to convert, use SIMD
For x86 targets, you can get a big speedup with SIMD vectors instead of looping one char at a time. For other compile targets you care about, you can write similar special versions. I assume ARM NEON has similar shuffling capability, for example.
When writing a platform-specific version, you also get to make all the endian and unaligned-access assumptions that are true on that platform.
#ifdef __SSE2__ // will be true for all x86-64 builds and most i386 builds
#include <immintrin.h>
static __m128i pack2(const short *p) {
__m128i lo = _mm_loadu_si128((__m128i*)p);
__m128i hi = _mm_loadu_si128((__m128i*)(p + 8));
lo = _mm_srli_epi16(lo, 8); // logical shift, not arithmetic, because we need the high byte to be zero
hi = _mm_srli_epi16(hi, 8);
return _mm_packus_epi16(lo, hi); // treats input as signed, saturates to unsigned 0x0 .. 0xff range
}
#endif // SSE2
void conv(char *a, const short *b) {
#ifdef __SSE2__
for(int i = 0; i < 48; i+=16) {
__m128i low = pack2(b+i);
_mm_storeu_si128((__m128i *)(a+i), low);
__m128i high = pack2(b+i + 50);
_mm_storeu_si128((__m128i *)(a+i + 48), high);
}
#else
/******* Fallback C version *******/
for(int i = 0; i < 48; i++) {
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
#endif
}
As you can see on the Godbolt Compiler Explorer, gcc fully unrolls the loop since it's only a few iterations when storing 16B at a time.
This should perform ok, but on pre-Skylake will bottleneck on shifting both vectors of shorts before the store. Haswell can only sustain one psrli per clock. (Skylake can sustain one per 0.5c when the shift-count is an immediate. See Agner Fog's guide and insn tables, links at the x86 tag wiki.)
You might get better results from loading from (__m128i*)(1 + (char*)p) so the bytes we want are already in the low half of each 16bit element. We'd still have to mask off the high half of each element with _mm_and_si128 instead of shifting, but PAND can run on any vector execution port, so it has three per clock throughput.
More importantly, with AVX it can be combined with an unaligned load. e.g. vpand xmm0, xmm5, [rsi], where xmm5 is a mask of _mm_set1_epi16(0x00ff), and [rsi] holds 2*i + 1 + (char*)b. fused-domain uop throughput is probably going to be an issue, like is common for code with a lot of loads/stores as well as computation.
Unaligned accesses are slightly slower than aligned accesses, but at least half your vector accesses will be unaligned anyway (since skipping two shorts means skipping 4B). On Intel SnB-family CPUs, I don't think it's slower to have loads that are split across a cache-line boundary in a 15:1 split compared to a 12:4 split. (The no-split case is definitely faster, though.) If b is 16B-aligned, then it'll be worth testing the mask version against the shift version.
I didn't write up complete code for this version, because you'll end up reading one byte past the end of b unless you take special precautions. This is fine if you make sure b has padding of some sort so it doesn't go right to the end of a memory page.
AVX2
With AVX2, vpackuswb ymm operates in two separate lanes. IDK if there's anything to gain from doing the load and mask (or shift) on 256b vectors and then using a vextracti128 and 128b pack on the two halves of the 256b vector.
Or maybe do a 256b pack between two vectors and then a vpermq (_mm256_permute4x64_epi64) to sort things out:
lo = _mm256_loadu(b..); // { b[15..8] | b[7..0] }
hi = // { b[31..24] | b[23..16] }
// mask or shift
__m256i packed = _mm256_packus_epi16(lo, hi); // [ a31..24 a15..8 | a23..16 a7..0 ]
packed = _mm256_permute4x64_epi64(packed, _MM_SHUFFLE(3, 1, 2, 0));
Of course, use any portable optimizations you can in the C version. e.g. Serge Ballesta's suggestion of just copying the desired bytes after figuring out their location from the endianness of the machine. (Preferably at compile time by checking GNU C's __BYTE_ORDER__ macro.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js