Get sum of values stored in __m256d with SSE/AVX - c++

Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm256_hadd_pd(acc, acc);
result[i] = ((double*)&acc)[0] + ((double*)&acc)[2];
This code works, but I want to replace it with SSE/AVX instruction.

It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.
For a dot-product of an array larger than one vector, sum vertically (into multiple accumulators), only hsumming once at the end.
For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction) - extract the high half and add to the low half. Repeat until you're down to 1 element.
If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.
For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:
#include <immintrin.h>
inline
double hsum_double_avx(__m256d v) {
__m128d vlow = _mm256_castpd256_pd128(v);
__m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
vlow = _mm_add_pd(vlow, vhigh); // reduce down to 128
__m128d high64 = _mm_unpackhi_pd(vlow, vlow);
return _mm_cvtsd_f64(_mm_add_sd(vlow, high64)); // reduce to scalar
}
If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.
Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer
vmovapd xmm1, xmm0 # silly compiler, vextract to xmm1 instead
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm1, xmm0
vunpckhpd xmm1, xmm0, xmm0 # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
vaddsd xmm0, xmm0, xmm1 # scalar means we never raise FP exceptions for results we don't use
vzeroupper
ret
After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.
On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.
#PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)
On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.
# Paul's version # Ryzen # Skylake
vhaddpd ymm0, ymm0, ymm0 # 8 uops # 3 uops
vperm2f128 ymm1, ymm0, ymm0, 49 # 8 uops # 1 uop
vaddpd ymm0, ymm0, ymm1 # 2 uops # 1 uop
# total uops: # 18 # 5
vs.
# my version with vmovapd optimized out: extract to a different reg
vextractf128 xmm1, ymm0, 0x1 # 1 uop # 1 uop
vaddpd xmm0, xmm1, xmm0 # 1 uop # 1 uop
vunpckhpd xmm1, xmm0, xmm0 # 1 uop # 1 uop
vaddsd xmm0, xmm0, xmm1 # 1 uop # 1 uop
# total uops: # 4 # 4
Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)
The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.
Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.

You can do it like this:
acc = _mm256_hadd_pd(acc, acc); // horizontal add top lane and bottom lane
acc = _mm256_add_pd(acc, _mm256_permute2f128_pd(acc, acc, 0x31)); // add lanes
result[i] = _mm256_cvtsd_f64(acc); // extract double
Note: if this is in a "hot" (i.e. performance-critical) part of your code (especially if running on an AMD CPU) then you might instead want to look at Peter Cordes's answer regarding more efficient implementations.

In gcc and clang SIMD types are built-in vector types. E.g.:
# avxintrin.h
typedef double __m256d __attribute__((__vector_size__(32), __aligned__(32)));
These built-in vectors support indexing, so you can write it conveniently and leave it up to the compiler to make good code:
double hsum_double_avx2(__m256d v) {
return v[0] + v[1] + v[2] + v[3];
}
clang-14 -O3 -march=znver3 -ffast-math generates the same assembly as it does for Peter Cordes's intrinsics:
# clang -O3 -ffast-math
hsum_double_avx2:
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
vzeroupper
ret
Unfortunately gcc does much worse, which generates sub-optimal instructions, not taking advantage of the freedom to re-associate the 3 + operations, and using vhaddpd xmm to do the v[0] + v[1] part, which costs 4 uops on on Zen 3. (Or 3 uops on Intel CPUs, 2 shuffles + an add.)
-ffast-math is of course necessary for the compiler to be able to do a good job, unless you write it as (v[0]+v[2]) + (v[1]+v[3]). With that, clang still makes the same asm with -O3 -march=icelake-server without -ffast-math.
Ideally, I want to write plain code as I did above and let the compiler use a CPU-specific cost model to emit optimal instructions in right order for that specific CPU.
One reason being that labour-intensive hand-coded optimal version for Haswell may well be suboptimal for Zen3. For this problem specifically, that's not really the case: starting by narrowing to 128-bit with vextractf128 + vaddpd is optimal everywhere. There are minor variations in shuffle throughput on different CPUs; for example Ice Lake and later Intel can run vshufps on port 1 or 5, but some shuffles like vpermilps/pd or vunpckhpd still only on port 5. Zen 3 (like Zen 2 and 4) has good throughput for either of those shuffles so clang's asm happens to be good there. But it's unfortunate that clang -march=icelake-server still uses vpermilpd
A frequent use-case nowadays is computing in the cloud with diverse CPU models and generations, compiling the code on that host with -march=native -mtune=native for best performance.
In theory, if compilers were smarter, this would optimize short sequences like this to ideal asm, as well as making generally good choices for heuristics like inlining and unrolling. It's usually the best choice for a binary that will run on only one machine, but as GCC demonstrates here, the results are often far from optimal. Fortunately modern AMD and Intel aren't too different most of the time, having different throughputs for some instructions but usually being single-uop for the same instructions.

Related

Use intrinsics to load bytes as 32-bit SIMD elements [duplicate]

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];
If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256 vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)
We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)
There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.

Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instructions

I have a float array containing A,B,C,D 4 float numbers and I wish to load them into a __m256 variable like AABBCCDD. What's the best way to do this?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions. Thanks.
If your data was the result of another vector calculation (and in a __m128), you'd want AVX2 vpermps (_mm256_permutexvar_ps) with a control vector of _mm256_set_epi32(3,3, 2,2, 1,1, 0,0).
vpermps ymm is 1 uop on Intel, but 2 uops on Zen2 (with 2 cycle throughput). And 3 uops on Zen1 with one per 4 clock throughput. (https://uops.info/)
If it was the result of separate scalar calculations, you might want to shuffle them together with _mm_set_ps(d,d, c,c) (1x vshufps) to set up for a vinsertf128.
But with data in memory, I think your best bet is a 128-bit broadcast-load, then an in-lane shuffle. It only requires AVX1, and on modern CPUs it's 1 load + 1 shuffle uop on Zen2 and Haswell and later. It's also efficient on Zen1: the only lane-crossing shuffle being the 128-bit broadcast-load.
Using an in-lane shuffle is lower-latency than lane-crossing on both Intel and Zen2 (256-bit shuffle execution units). This still requires a 32-byte shuffle control vector constant, but if you need to do this frequently it will typically / hopefully stay hot in cache.
__m256 duplicate4floats(void *p) {
__m256 v = _mm256_broadcast_ps((const __m128 *) p); // vbroadcastf128
v = _mm256_permutevar_ps(v, _mm256_set_epi32(3,3, 2,2, 1,1, 0,0)); // vpermilps
return v;
}
Modern CPUs handle broadcast-loads right in the load port, no shuffle uop needed. (Sandybridge does need a port 5 shuffle uop for vbroadcastf128, unlike narrower broadcasts, but Haswell and later are purely port 2/3. But SnB doesn't support AVX2 so a lane-crossing shuffle with granularity less than 128-bit wasn't an option.)
So even if AVX2 is available, I think AVX1 instructions are more efficient here. On Zen1, vbroadcastf128 is 2 uops, vs. 1 for a 128-bit vmovups, but vpermps (lane-crossing) is 3 uops vs. 2 for vpermilps.
Unfortunately, clang pessimizes this into a vmovups load and a vpermps ymm, but GCC compiles it as written. (Godbolt)
If you wanted to avoid using a shuffle-control vector constant, vpmovzxdq ymm, [mem] (2 uops on Intel) could get the elements set up for vmovsldup (1 uops in-lane shuffle). Or broadcast-load and vunpckl/hps then blend?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions.
Get a better compiler, then! (Or remember to enable optimization.)
__m256 duplicate4floats_naive(const float *p) {
return _mm256_set_ps(p[3],p[3], p[2], p[2], p[1],p[1], p[0],p[0]);
}
compiles with gcc (https://godbolt.org/z/dMzh3fezE) into
duplicate4floats_naive(float const*):
vmovups xmm1, XMMWORD PTR [rdi]
vpermilps xmm0, xmm1, 80
vpermilps xmm1, xmm1, 250
vinsertf128 ymm0, ymm0, xmm1, 0x1
ret
So 3 shuffle uops, not great. And it could have used vshufps instead of vpermilps to save code-size and let it run on more ports on Ice Lake. But still vastly better than 8 instructions.
clang's shuffle optimizer makes the same asm as with my optimized intrinsics, because that's how clang is. It's pretty decent optimization, just not quite optimal.
duplicate4floats_naive(float const*):
vmovups xmm0, xmmword ptr [rdi]
vmovaps ymm1, ymmword ptr [rip + .LCPI1_0] # ymm1 = [0,0,1,1,2,2,3,3]
vpermps ymm0, ymm1, ymm0
ret
_mm_load_ps -> _mm256_castps128_ps256 -> _mm256_permute_ps

Intel store instructions on delibrately overlapping memory regions

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had
reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);
and compiling by clang gives
vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128 xmmword ptr [rsi], ymm0, 1
If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?
Other ways to store YMM into size 3 arrays are welcomed as well.
There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).
It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.
The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.
In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.
However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:
Operation
MEM[loaddr+127:loaddr] := a[127:0]
MEM[hiaddr+127:hiaddr] := a[255:128]
So it's strictly defined as low address store first (the second arg; I think you got this backwards).
And all of that is also moot because there's a more efficient way
Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).
On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)
On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.
You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.
The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.
vextractf128 xmm1, ymm0, 1 ; single uop everywhere
vmovupd [rdi], xmm0 ; single uop everywhere
vmovsd [rdi+2*8], xmm1 ; single uop everywhere
So you want something like this, which should compile efficiently.
_mm_store_pd(vec, _mm256_castpd256_pd128(reg)); // low half
__m128d hi = _mm256_extractf128_pd(reg, 1);
_mm_store_sd(vec+2, hi);
// or vec[2] = _mm_cvtsd_f64(hi);
vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.
There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.
Masked store:
As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.
Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.
On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.
On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).
So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.
vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

Performance of single/double precision SpMV on CPUs

The sparse matrix-vector product is a memory bound operation due to a very low arithmetic intensity. Since a float storage format would require 4+4=8 bytes per non-zero compared to 4+8=12 bytes for doubles (value and column index), one should be able to expect about 33% faster execution when switching to floats. I constructed a benchmark which assembles a 1000000x1000000 matrix with 200 non-zeros per row, and then take the minimum from 20 multiplications. Source code on github here.
The results are roughly what I expected. When I run the benchmark on my Intel Core i7-2620M, I see something like a 30% faster execution. The small difference can be seen in the bandwidth drop from about 19.0 GB/s (doubles) to about 18.0 GB/s (floats) out of the 21.3 GB/s in the spec.
Now, since the data for the matrix is almost 1000 larger than that for the vectors, one would expect that the faster performance should be obtained also for the case where only the matrix is in single precision, but where the vectors remains as doubles. I went on to try this, and then made sure to use the lower precision for the computations. When I run it, however, effective bandwidth usage suddenly drops to about 14.4 GB/s, giving only a 12% faster execution than the full double version. How can one understand this?
I'm using Ubuntu 14.04 with GCC 4.9.3.
Run times:
// double(mat)-double(vec)
Wall time: 0.127577 s
Bandwidth: 18.968 GB/s
Compute: 3.12736 Gflop/s
// float(mat)-float(vec)
Wall time: 0.089386 s
Bandwidth: 18.0333 GB/s
Compute: 4.46356 Gflop/s
// float(mat)-double(vec)
Wall time: 0.112134 s
Bandwidth: 14.4463 GB/s
Compute: 3.55807 Gflop/s
Update
See the answer by Peter Cordes below. In short, dependencies between loop iterations from the double-to-float conversion are responsible for the overhead. By unrolling the loop (see unroll-loop branch at github), full bandwidth usage is regained for both the float-double and the float-float versions!
New run times:
// float(mat)-float(vec)
Wall time: 0.084455 s
Bandwidth: 19.0861 GB/s
Compute: 4.72417 Gflop/s
// float(mat)-double(vec)
Wall time: 0.0865598 s
Bandwidth: 18.7145 GB/s
Compute: 4.6093 Gflop/s
The double-float loop that has to convert on the fly can't issue quite as fast. With some loop unrolling, gcc would probably do a better job.
Your i7-2620M is a dual-core with hyperthreading. Hyperthreading doesn't help when the bottleneck is CPU uop throughput, rather than branch mispredicts, cache misses, or even just long latency chains. Saturating your memory bandwidth with just scalar operations isn't easy.
From the asm output for your code on the Godbolt Compiler Explorer: gcc 5.3 makes about the same inner loop, BTW, so you're not losing out on much in this case by using an old gcc version.
double-double version inner loop (gcc 4.9.3 -O3 -march=sandybridge -fopenmp):
## inner loop of <double,double>mult() with fused-domain uop counts
.L7:
mov edx, eax # 1 uop
add eax, 1 # 1 uop
mov ecx, DWORD PTR [r9+rdx*4] # 1 uop
vmovsd xmm0, QWORD PTR [r10+rdx*8] # 1 uop
vmulsd xmm0, xmm0, QWORD PTR [r8+rcx*8] # 2 uops
vaddsd xmm1, xmm1, xmm0 # 1 uop
cmp eax, esi # (macro-fused)
jne .L7 # 1 uop
total: 8 fused-domain uops, can issue at one iter per two clocks. It can also execute that fast: Three of the uops are loads, and SnB can do 4 loads per 2 clocks. 5 ALU uops are left (since SnB can't eliminate reg-reg moves in the rename stage, that was introduced with IvB). Anyway, there are no obvious bottlenecks on a single execution port. SnB's three ALU ports could handle up to six ALU uops per two cycles.
There's no micro-fusion because of using two-register addressing modes.
double-float version inner loop:
## inner loop of <double,float>mult() with fused-domain uop counts
.L7:
mov edx, eax # 1 uop
vxorpd xmm0, xmm0, xmm0 # 1 uop (no execution unit needed).
add eax, 1 # 1 uop
vcvtss2sd xmm0, xmm0, DWORD PTR [r9+rdx*4] # 2 uops
mov edx, DWORD PTR [r8+rdx*4] # 1 uop
vmulsd xmm0, xmm0, QWORD PTR [rsi+rdx*8] # 2 uops
vaddsd xmm1, xmm1, xmm0 # 1 uop
cmp eax, ecx # (macro-fused)
jne .L7 # 1 uop
gcc uses the xorpd to break the loop-carried dependency chain. cvtss2sd has a false dependency on the old value of xmm0, because it's badly designed and doesn't zero the top half of the register. (movsd when used as a load does zero, but not when used as a reg-reg move. In that case, use movaps unless you want merging.)
So, 10 fused-domain uops: can issue at one iteration per three clocks. I assume this is the only bottleneck, since it's just one extra ALU uop that needs an execution port. (SnB handles zeroing-idioms in the rename stage, so xorpd doesn't need one). cvtss2sd is a 2 uop instruction that apparently can't micro-fuse even if gcc used a one-register addressing mode. It has a throughput of one per clock. (On Haswell, it's a 2 uop instruction when used with a register src and dest, and on Skylake the throughput is reduced to one per 2 clocks, according to Agner Fog's testing.) That still wouldn't be a bottleneck for this loop on Skylake, though. It's still 10 fused-domain uops on Haswell / Skylake, and that's still the bottleneck.
-funroll-loops should help with gcc 4.9.3
gcc does a moderately good job, with code like
mov edx, DWORD PTR [rsi+r14*4] # D.56355, *_40
lea r14d, [rax+2] # D.56355,
vcvtss2sd xmm6, xmm4, DWORD PTR [r8+r14*4] # D.56358, D.56358, *_36
vmulsd xmm2, xmm1, QWORD PTR [rcx+rdx*8] # D.56358, D.56358, *_45
vaddsd xmm14, xmm0, xmm13 # tmp, tmp, D.56358
vxorpd xmm1, xmm1, xmm1 # D.56358
mov edx, DWORD PTR [rsi+r14*4] # D.56355, *_40
lea r14d, [rax+3] # D.56355,
vcvtss2sd xmm10, xmm9, DWORD PTR [r8+r14*4] # D.56358, D.56358, *_36
vmulsd xmm7, xmm6, QWORD PTR [rcx+rdx*8] # D.56358, D.56358, *_45
vaddsd xmm3, xmm14, xmm2 # tmp, tmp, D.56358
vxorpd xmm6, xmm6, xmm6 # D.56358
Without the loop overhead, the work for each element is down to 8 fused-domain uops, and it's not a tiny loop that suffers from only issuing 2 uops every 3rd cycle (because 10 isn't a multiple of 4).
It could save the lea instructions by using displacements, e.g. [r8+rax*4 + 12]. IDK why gcc chooses not to.
Not even -ffast-math gets it to vectorize at all. There's probably no point since doing a gather from the sparse matrix would outweigh the benefit of doing a load of 4 or 8 contiguous values from the non-sparse vector. (insertps from memory is a 2-uop instruction that can't micro-fuse even with one-register addressing modes.)
On Broadwell or Skylake, vgatherdps may be fast enough to give a speedup over. Probably a big speedup on Skylake. (Can gather 8 single-precision floats with a throughput of 8 floats per 5 clocks. Or vgatherqpd can gather 4 double-precision floats with a throughput of 4 doubles per 4 clocks). This sets you up for a 256b vector FMA.

Loading 8 chars from memory into an __m256 variable as packed single precision floats

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];
If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256 vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)
We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)
There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.