Fastest downscaling of 8bit gray image with SSE - c++

I have a function which downscales an 8-Bit image by a factor of two. I have previously optimised the rgb32 case with SSE. Now I would like to do the same for the gray8 case.
At the core, there is a function taking two lines of pixel data, which works like this:
/**
* Calculates the average of two rows of gray8 pixels by averaging four pixels.
*/
void average2Rows(const uint8_t* row1, const uint8_t* row2, uint8_t* dst, int size)
{
for (int i = 0; i < size - 1; i += 2)
*(dst++) = ((row1[i]+row1[i+1]+row2[i]+row2[i+1])/4)&0xFF;
}
Now, I have come up with an SSE variant which is about three times faster, but it does involve a lot of shuffling and I think one might do better. Does anybody see what can be optimised here?
/* row1: 16 8-bit values A-P
* row2: 16 8-bit values a-p
* returns 16 8-bit values (A+B+a+b)/4, (C+D+c+d)/4, ..., (O+P+o+p)/4
*/
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
static const __m128i zero = _mm_setzero_si128();
__m128i ABCDEFGHIJKLMNOP = _mm_avg_epu8(row1_u8, row2);
__m128i ABCDEFGH = _mm_unpacklo_epi8(ABCDEFGHIJKLMNOP, zero);
__m128i IJKLMNOP = _mm_unpackhi_epi8(ABCDEFGHIJKLMNOP, zero);
__m128i AIBJCKDL = _mm_unpacklo_epi16( ABCDEFGH, IJKLMNOP );
__m128i EMFNGOHP = _mm_unpackhi_epi16( ABCDEFGH, IJKLMNOP );
__m128i AEIMBFJN = _mm_unpacklo_epi16( AIBJCKDL, EMFNGOHP );
__m128i CGKODHLP = _mm_unpackhi_epi16( AIBJCKDL, EMFNGOHP );
__m128i ACEGIKMO = _mm_unpacklo_epi16( AEIMBFJN, CGKODHLP );
__m128i BDFHJLNP = _mm_unpackhi_epi16( AEIMBFJN, CGKODHLP );
return _mm_avg_epu8(ACEGIKMO, BDFHJLNP);
}
/*
* Calculates the average of two rows of gray8 pixels by averaging four pixels.
*/
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
for(int i = 0;i<size-31; i+=32)
{
__m128i tl = _mm_loadu_si128((__m128i const*)(src1+i));
__m128i tr = _mm_loadu_si128((__m128i const*)(src1+i+16));
__m128i bl = _mm_loadu_si128((__m128i const*)(src2+i));
__m128i br = _mm_loadu_si128((__m128i const*)(src2+i+16)))
__m128i l_avg = avg16Bytes(tl, bl);
__m128i r_avg = avg16Bytes(tr, br);
_mm_storeu_si128((__m128i *)(dst+(i/2)), _mm_packus_epi16(l_avg, r_avg));
}
}
Notes:
I realise my function has slight (off by one) rounding errors, but I am willing to accept this.
For clarity I have assumed size is a multiple of 32.
EDIT: There is now a github repository implementing the answers to this question. The fastest solution was provided by user Peter Cordes. See his essay below for details:
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
// Average the first 16 values of src1 and src2:
__m128i avg = _mm_avg_epu8(row1, row2);
// Unpack and horizontal add:
avg = _mm_maddubs_epi16(avg, _mm_set1_epi8(1));
// Divide by 2:
return _mm_srli_epi16(avg, 1);
}
It works as my original implementation by calculating (a+b)/2 + (c+d)/2 as opposed to (a+b+c+d)/4, so it has the same off-by-one rounding error.
Cudos to user Paul R for implementing a solution which is twice as fast as mine, but exact:
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
// Unpack and horizontal add:
__m128i row1 = _mm_maddubs_epi16(row1_u8, _mm_set1_epi8(1));
__m128i row2 = _mm_maddubs_epi16(row2_u8, _mm_set1_epi8(1));
// vertical add:
__m128i avg = _mm_add_epi16(row1_avg, row2_avg);
// divide by 4:
return _mm_srli_epi16(avg, 2);
}

If you're willing to accept double-rounding from using pavgb twice, you can go faster than Paul R's answer by doing the vertical averaging first with pavgb, cutting in half the amount of data that needs to be unpacked to 16-bit elements. (And allowing half the loads to fold into memory operands for pavgb, reducing front-end bottlenecks on some CPUs.)
For horizontal averaging, your best bet is probably still pmaddubsw with set1(1) and shift by 1, then pack.
// SSSE3 version
// I used `__restrict__` to give the compiler more flexibility in unrolling
void average2Rows_doubleround(const uint8_t* __restrict__ src1, const uint8_t*__restrict__ src2,
uint8_t*__restrict__ dst, size_t size)
{
const __m128i vk1 = _mm_set1_epi8(1);
size_t dstsize = size/2;
for (size_t i = 0; i < dstsize - 15; i += 16)
{
__m128i v0 = _mm_load_si128((const __m128i *)&src1[i*2]);
__m128i v1 = _mm_load_si128((const __m128i *)&src1[i*2 + 16]);
__m128i v2 = _mm_load_si128((const __m128i *)&src2[i*2]);
__m128i v3 = _mm_load_si128((const __m128i *)&src2[i*2 + 16]);
__m128i left = _mm_avg_epu8(v0, v2);
__m128i right = _mm_avg_epu8(v1, v3);
__m128i w0 = _mm_maddubs_epi16(left, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(right, vk1);
w0 = _mm_srli_epi16(w0, 1); // divide by 2
w1 = _mm_srli_epi16(w1, 1);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i], w0);
}
}
The other option is _mm_srli_epi16(v, 8) to line up the odd elements with the even elements of every horizontal pair. But since there is no horizontal pack with truncation, you have to _mm_and_si128(v, _mm_set1_epi16(0x00FF)) both halves before you pack. It turns out to be slower than using SSSE3 pmaddubsw, especially without AVX where it takes extra MOVDQA instructions to copy registers.
void average2Rows_doubleround_SSE2(const uint8_t* __restrict__ src1, const uint8_t* __restrict__ src2, uint8_t* __restrict__ dst, size_t size)
{
size /= 2;
for (size_t i = 0; i < size - 15; i += 16)
{
__m128i v0 = _mm_load_si128((__m128i *)&src1[i*2]);
__m128i v1 = _mm_load_si128((__m128i *)&src1[i*2 + 16]);
__m128i v2 = _mm_load_si128((__m128i *)&src2[i*2]);
__m128i v3 = _mm_load_si128((__m128i *)&src2[i*2 + 16]);
__m128i left = _mm_avg_epu8(v0, v2);
__m128i right = _mm_avg_epu8(v1, v3);
__m128i l_odd = _mm_srli_epi16(left, 8); // line up horizontal pairs
__m128i r_odd = _mm_srli_epi16(right, 8);
__m128i l_avg = _mm_avg_epu8(left, l_odd); // leaves garbage in the high halves
__m128i r_avg = _mm_avg_epu8(right, r_odd);
l_avg = _mm_and_si128(l_avg, _mm_set1_epi16(0x00FF));
r_avg = _mm_and_si128(r_avg, _mm_set1_epi16(0x00FF));
__m128i avg = _mm_packus_epi16(l_avg, r_avg); // pack
_mm_storeu_si128((__m128i *)&dst[i], avg);
}
}
With AVX512BW, there's _mm_cvtepi16_epi8, but IACA says it's 2 uops on Skylake-AVX512, and it only takes 1 input and produces a half-width output. According to IACA, the memory-destination form is 4 total unfused-domain uops (same as reg,reg + separate store). I had to use _mm_mask_cvtepi16_storeu_epi8(&dst\[i+0\], -1, l_avg); to get it, because gcc and clang fail to fold a separate _mm_store into a memory destination for vpmovwb. (There is no non-masked store intrinsic, because compilers are supposed to do that for you like they do with folding _mm_load into memory operands for typical ALU instructions).
It's probably only useful when narrowing to 1/4 or 1/8th (cvtepi64_epi8), not just in half. Or maybe useful to avoid needing a second shuffle to deal with the in-lane behaviour of _mm512_packus_epi16. With AVX2, after a _mm256_packus_epi16 on [D C] [B A], you have [D B | C A], which you can fix with an AVX2 _mm256_permute4x64_epi64 (__m256i a, const int imm8) to shuffle in 64-bit chunks. But with AVX512, you'd need a vector shuffle-control for the vpermq. packus + a fixup shuffle is probably still a better option, though.
Once you do this, there aren't many vector instructions left in the loop, and there's a lot to gain from letting the compiler make tighter asm. Your loop is unfortunately difficult for compilers to do a good job with.
(This also helps Paul R's solution, since he copied the compiler-unfriendly loop structure from the question.)
Use the loop-counter in a way that gcc/clang can optimize better, and use types that avoid re-doing sign extension every time through the loop.
With your current loop, gcc/clang actually do an arithmetic right-shift for i/2, instead of incrementing by 16 (instead of 32) and using scaled-index addressing modes for the loads. It seems they don't realize that i is always even.
(full code + asm on Matt Godbolt's compiler explorer):
.LBB1_2: ## clang's inner loop for int i, dst[i/2] version
movdqu xmm1, xmmword ptr [rdi + rcx]
movdqu xmm2, xmmword ptr [rdi + rcx + 16]
movdqu xmm3, xmmword ptr [rsi + rcx]
movdqu xmm4, xmmword ptr [rsi + rcx + 16]
pavgb xmm3, xmm1
pavgb xmm4, xmm2
pmaddubsw xmm3, xmm0
pmaddubsw xmm4, xmm0
psrlw xmm3, 1
psrlw xmm4, 1
packuswb xmm3, xmm4
mov eax, ecx # This whole block is wasted instructions!!!
shr eax, 31
add eax, ecx
sar eax # eax = ecx/2, with correct rounding even for negative `i`
cdqe # sign-extend EAX into RAX
movdqu xmmword ptr [rdx + rax], xmm3
add rcx, 32 # i += 32
cmp rcx, r8
jl .LBB1_2 # }while(i < size-31)
gcc7.1 isn't quite so bad, (just mov/sar/movsx), but gcc5.x and 6.x do separate pointer-increments for src1 and src2, and also for a counter/index for the stores. (Totally braindead behaviour, especially since they still do it with -march=sandybridge. Indexed movdqu stores and non-indexed movdqu loads gives you the maximum loop overhead.)
Anyway, using dstsize and multiplying i inside the loop instead of dividing it gives much better results. Different versions of gcc and clang reliably compile it into a single loop-counter that they use with a scaled-index addressing mode for the loads. You get code like:
movdqa xmm1, xmmword ptr [rdi + 2*rax]
movdqa xmm2, xmmword ptr [rdi + 2*rax + 16]
pavgb xmm1, xmmword ptr [rsi + 2*rax]
pavgb xmm2, xmmword ptr [rsi + 2*rax + 16] # saving instructions with aligned loads, see below
...
movdqu xmmword ptr [rdx + rax], xmm1
add rax, 16
cmp rax, rcx
jb .LBB0_2
I used size_t i to match size_t size, to make sure gcc didn't waste any instructions sign-extending or zero-extending it to the width of a pointer. (zero-extension usually happens for free, though, so unsigned size and unsigned i might have been ok, and saved a couple REX prefixes.)
You could still get rid of the cmp but counting an index up towards 0, which would speed the loop up a little bit more than what I've done. I'm not sure how easy it would be to get compilers to not be stupid and omit the cmp instruction if you do count up towards zero. Indexing from the end of an object is no problem, though. src1+=size;. It does complicate things if you want to use an unaligned-cleanup loop, though.
On my Skylake i7-6700k (max turbo 4.4GHz, but look at the clock-cycle counts rather than times). With g++7.1, this makes a difference of ~2.7 seconds for 100M reps of 1024 bytes vs. ~3.3 seconds.
Performance counter stats for './grayscale-dowscale-by-2.inline.gcc-skylake-noavx' (2 runs):
2731.607950 task-clock (msec) # 1.000 CPUs utilized ( +- 0.40% )
2 context-switches # 0.001 K/sec ( +- 20.00% )
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.032 K/sec ( +- 0.57% )
11,917,723,707 cycles # 4.363 GHz ( +- 0.07% )
42,006,654,015 instructions # 3.52 insn per cycle ( +- 0.00% )
41,908,837,143 uops_issued_any # 15342.186 M/sec ( +- 0.00% )
49,409,631,052 uops_executed_thread # 18088.112 M/sec ( +- 0.00% )
3,301,193,901 branches # 1208.517 M/sec ( +- 0.00% )
100,013,629 branch-misses # 3.03% of all branches ( +- 0.01% )
2.731715466 seconds time elapsed ( +- 0.40% )
vs. Same vectorization, but with int i and dst[i/2] creating higher loop overhead (more scalar instructions):
Performance counter stats for './grayscale-dowscale-by-2.loopoverhead-aligned-inline.gcc-skylake-noavx' (2 runs):
3314.335833 task-clock (msec) # 1.000 CPUs utilized ( +- 0.02% )
4 context-switches # 0.001 K/sec ( +- 14.29% )
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.026 K/sec ( +- 0.57% )
14,531,925,552 cycles # 4.385 GHz ( +- 0.06% )
51,607,478,414 instructions # 3.55 insn per cycle ( +- 0.00% )
51,109,303,460 uops_issued_any # 15420.677 M/sec ( +- 0.00% )
55,810,234,508 uops_executed_thread # 16839.040 M/sec ( +- 0.00% )
3,301,344,602 branches # 996.080 M/sec ( +- 0.00% )
100,025,451 branch-misses # 3.03% of all branches ( +- 0.00% )
3.314418952 seconds time elapsed ( +- 0.02% )
vs. Paul R's version (optimized for lower loop overhead): exact but slower
Performance counter stats for './grayscale-dowscale-by-2.paulr-inline.gcc-skylake-noavx' (2 runs):
3751.990587 task-clock (msec) # 1.000 CPUs utilized ( +- 0.03% )
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.024 K/sec ( +- 0.56% )
16,323,525,446 cycles # 4.351 GHz ( +- 0.04% )
58,008,101,634 instructions # 3.55 insn per cycle ( +- 0.00% )
57,610,721,806 uops_issued_any # 15354.709 M/sec ( +- 0.00% )
55,505,321,456 uops_executed_thread # 14793.566 M/sec ( +- 0.00% )
3,301,456,435 branches # 879.921 M/sec ( +- 0.00% )
100,001,954 branch-misses # 3.03% of all branches ( +- 0.02% )
3.752086635 seconds time elapsed ( +- 0.03% )
vs. Paul R's original version with extra loop overhead:
Performance counter stats for './grayscale-dowscale-by-2.loopoverhead-paulr-inline.gcc-skylake-noavx' (2 runs):
4154.300887 task-clock (msec) # 1.000 CPUs utilized ( +- 0.01% )
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
90 page-faults:u # 0.022 K/sec ( +- 1.68% )
18,174,791,383 cycles # 4.375 GHz ( +- 0.03% )
67,608,724,157 instructions # 3.72 insn per cycle ( +- 0.00% )
66,937,292,129 uops_issued_any # 16112.769 M/sec ( +- 0.00% )
61,875,610,759 uops_executed_thread # 14894.350 M/sec ( +- 0.00% )
3,301,571,922 branches # 794.736 M/sec ( +- 0.00% )
100,029,270 branch-misses # 3.03% of all branches ( +- 0.00% )
4.154441330 seconds time elapsed ( +- 0.01% )
Note that branch-misses is about the same as the repeat count: the inner loop mispredicts at the end every time. Unrolling to keep the loop iteration count under about 22 would make the pattern short enough for Skylake's branch predictors to predict the not-taken condition correctly most of the time. Branch mispredicts are the only reason we're not getting ~4.0 uops per cycle through the pipeline, so avoiding branch misses would raise the IPC from 3.5 to over 4.0 (cmp/jcc macro-fusion puts 2 instructions in one uop).
These branch-misses probably hurt even if you're bottlenecked on L2 cache bandwidth (instead of the front-end). I didn't test that, though: my testing just wraps a for() loop around the function call from Paul R's test harness, so everything's hot in L1D cache. 32 iterations of the inner loop is close to the worst-case here: low enough for frequent mispredicts, but not so low that branch-prediction can pick up the pattern and avoid them.
My version should run in 3 cycles per iteration, bottlenecked only on the frontend, on Intel Sandybridge and later. (Nehalem will bottleneck on one load per clock.)
See http://agner.org/optimize/, and also Can x86's MOV really be "free"? Why can't I reproduce this at all? for more about fused-domain vs. unfused-domain uops and perf counters.
update: clang unrolls it for you, at least when the size is a compile-time constant... Strangely, it unrolls even the non-inline version of the dst[i/2] function (with unknown size), but not the lower-loop-overhead version.
With clang++-4.0 -O3 -march=skylake -mno-avx, my version (unrolled by 2 by the compiler) runs in: 9.61G cycles for 100M iters (2.2s). (35.6G uops issued (fused domain), 45.0G uops executed (unfused domain), near-zero branch-misses.) Probably not bottlenecked on the front-end anymore, but AVX would still hurt.
Paul R's (also unrolled by 2) runs in 12.29G cycles for 100M iters (2.8s). 48.4G uops issued (fused domain), 51.4G uops executed (unfused-domain). 50.1G instructions, for 4.08 IPC, probably still bottlenecked on the front-end (because it needs a couple movdqa instructions to copy a register before destroying it). AVX would help for non-destructive vector instructions, even without AVX2 for wider integer vectors.
With careful coding, you should be able to do about this well for runtime-variable sizes.
Use aligned pointers and aligned loads, so the compiler can use pavgb with a memory operand instead of using a separate unaligned-load instruction. This means fewer instructions and fewer uops for the front-end, which is a bottleneck for this loop.
This doesn't help Paul's version, because only the second operand for pmaddubsw can come from memory, and that's the one treated as signed bytes. If we used _mm_maddubs_epi16(_mm_set1_epi8(1), v0);, the 16-bit multiply result would be sign-extended instead of zero-extended. So 1+255 would come out to 0 instead of 256.
Folding a load requires alignment with SSE, but not with AVX. However, on Intel Haswell/Skylake, indexed addressing modes can only stay micro-fused with instructions which read-modify-write their destination register. vpavgb xmm0, xmm0, [rsi+rax*2] is un-laminated to 2 uops on Haswell/Skylake before it issues into the out-of-order part of the core, but pavgb xmm1, [rsi+rax*2] can stay micro-fused all the way through, so it issues as a single uop. The front-end issue bottleneck is 4 fused-domain uops per clock on mainstream x86 CPUs except Ryzen (i.e. not Atom/Silvermont). Folding half the loads into memory operands helps with that on all Intel CPUs except Sandybridge/Ivybridge, and all AMD CPUs.
gcc and clang will fold the loads when inlining into a test function that uses alignas(32), even if you use _mm_loadu intrinsics. They know the data is aligned, and take advantage.
Weird fact: compiling the 128b-vectorized code with AVX code-gen enabled (-march=native) actually slows it down on Haswell/Skylake, because it would make all 4 loads issue as separate uops even when they're memory operands for vpavgb, and there aren't any movdqa register-copying instructions that AVX would avoid. (Usually AVX comes out ahead anyway even for manually-vectorized code that still only uses 128b vectors, because of the benefit of 3-operand instructions not destroying one of their inputs.) In this case, 13,53G cycles ( +- 0.05% ) or 3094.195773 ms ( +- 0.20% ), up from 11.92G cycles in ~2.7 seconds. uops_issued = 48.508G, up from 41,908. Instruction count and uops_executed counts are essentially the same.
OTOH, an actual 256b AVX2 version would run slightly a bit less than twice as fast. Some unrolling to reduce the front-end bottleneck will definitely help. An AVX512 version might run close to 4x as fast on Skylake-AVX512 Xeons, but might bottleneck on ALU throughput since SKX shuts down execution port1 when there are any 512b uops in the RS waiting to execute, according to #Mysticial's testing. (That explains why pavgb zmm has 1 per clock throughput while pavgb ymm is 2 per clock..)
To have both input rows aligned, store your image data in a format with a row stride that's a multiple of 16, even if the actual image dimensions are odd. Your storage stride doesn't have to match your actual image dimensions.
If you can only align either the source or dest (e.g. because you're downscaling a region that starts at an odd column in the source image), you should probably still align your source pointers.
Intel's optimization manual recommends aligning the destination instead of the source, if you can't align both, but doing 4x as many loads as stores probably changes the balance.
To handle unaligned at the start/end, do a potentially-overlapping unaligned vector of pixels from the start and end. It's fine for stores to overlap other stores, and since dst is separate from src, you can redo a partially-overlapping vector.
In Paul's test main(), I just added alignas(32) in front of every array.
AVX2:
Since you compile one version with -march=native, you can easily detect AVX2 at compile time with #ifdef __AVX2__. There's no simple way to use exactly the same code for 128b and 256b manual vectorization. All the intrinsics have different names, so you typically need to copy everything even if there are no other differences.
(There are some C++ wrapper libraries for the intrinsics that use operator-overloading and function overloading to let you write a templated version that uses the same logic on different widths of vector. e.g. Agner Fog's VCL is good, but unless your software is open-source, you can't use it because it's GPL licensed and you want to distribute a binary.)
To take advantage of AVX2 in your binary-distribution version, you'd have to do runtime detection/dispatching. In that case, you'd want to dispatch to versions of a function that loops over rows, so you don't have dispatch overhead inside your loop over rows. Or just let that version use SSSE3.

Here is an implementation which uses fewer instructions. I haven't benchmarked it against your code though, so it may not be significantly faster:
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
const __m128i vk1 = _mm_set1_epi8(1);
for (int i = 0; i < size - 31; i += 32)
{
__m128i v0 = _mm_loadu_si128((__m128i *)&src1[i]);
__m128i v1 = _mm_loadu_si128((__m128i *)&src1[i + 16]);
__m128i v2 = _mm_loadu_si128((__m128i *)&src2[i]);
__m128i v3 = _mm_loadu_si128((__m128i *)&src2[i + 16]);
__m128i w0 = _mm_maddubs_epi16(v0, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(v1, vk1);
__m128i w2 = _mm_maddubs_epi16(v2, vk1);
__m128i w3 = _mm_maddubs_epi16(v3, vk1);
w0 = _mm_add_epi16(w0, w2); // vertical add
w1 = _mm_add_epi16(w1, w3);
w0 = _mm_srli_epi16(w0, 2); // divide by 4
w1 = _mm_srli_epi16(w1, 2);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i / 2], w0);
}
}
Test harness:
#include <stdio.h>
#include <stdlib.h>
#include <tmmintrin.h>
void average2Rows_ref(const uint8_t* row1, const uint8_t* row2, uint8_t* dst, int size)
{
for (int i = 0; i < size - 1; i += 2)
{
dst[i / 2] = (row1[i] + row1[i + 1] + row2[i] + row2[i + 1]) / 4;
}
}
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
const __m128i vk1 = _mm_set1_epi8(1);
for (int i = 0; i < size - 31; i += 32)
{
__m128i v0 = _mm_loadu_si128((__m128i *)&src1[i]);
__m128i v1 = _mm_loadu_si128((__m128i *)&src1[i + 16]);
__m128i v2 = _mm_loadu_si128((__m128i *)&src2[i]);
__m128i v3 = _mm_loadu_si128((__m128i *)&src2[i + 16]);
__m128i w0 = _mm_maddubs_epi16(v0, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(v1, vk1);
__m128i w2 = _mm_maddubs_epi16(v2, vk1);
__m128i w3 = _mm_maddubs_epi16(v3, vk1);
w0 = _mm_add_epi16(w0, w2); // vertical add
w1 = _mm_add_epi16(w1, w3);
w0 = _mm_srli_epi16(w0, 2); // divide by 4
w1 = _mm_srli_epi16(w1, 2);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i / 2], w0);
}
}
int main()
{
const int n = 1024;
uint8_t src1[n];
uint8_t src2[n];
uint8_t dest_ref[n / 2];
uint8_t dest_test[n / 2];
for (int i = 0; i < n; ++i)
{
src1[i] = rand();
src2[i] = rand();
}
for (int i = 0; i < n / 2; ++i)
{
dest_ref[i] = 0xaa;
dest_test[i] = 0x55;
}
average2Rows_ref(src1, src2, dest_ref, n);
average2Rows(src1, src2, dest_test, n);
for (int i = 0; i < n / 2; ++i)
{
if (dest_test[i] != dest_ref[i])
{
printf("%u %u %u %u: ref = %u, test = %u\n", src1[2 * i], src1[2 * i + 1], src2[2 * i], src2[2 * i + 1], dest_ref[i], dest_test[i]);
}
}
return 0;
}
Note that the output of the SIMD version exactly matches the output of the scalar reference code (no "off by one" rounding errors).

Related

Efficiently compute absolute values of std::complex<float> vector with AVX2

For some real-time DSP application I need to compute the absolute values of a complex valued vector.
The straightforward implementation would look like that
computeAbsolute (std::complex<float>* complexSourceVec,
float* realValuedDestinationVec,
int vecLength)
{
for (int i = 0; i < vecLength; ++i)
realValuedDestinationVec[i] = std::abs (complexSourceVec[i]);
}
I want to replace this implementation with an AVX2 optimized version, based on AVX2 instrincts. What would be the most efficient way to implement it that way?
Note: The source data is handed to me by an API I have no access to, so there is no chance to change the layout of the complex input vector for better efficiency.
Inspired by the answer of Dan M. I first implemented his version with some tweaks:
First changed it to use the wider 256 Bit registers, then marked the temporary re and im arrays with __attribute__((aligned (32))) to be able to use aligned load
void computeAbsolute1 (const std::complex<float>* cplxIn, float* absOut, const int length)
{
for (int i = 0; i < length; i += 8)
{
float re[8] __attribute__((aligned (32))) = {cplxIn[i].real(), cplxIn[i + 1].real(), cplxIn[i + 2].real(), cplxIn[i + 3].real(), cplxIn[i + 4].real(), cplxIn[i + 5].real(), cplxIn[i + 6].real(), cplxIn[i + 7].real()};
float im[8] __attribute__((aligned (32))) = {cplxIn[i].imag(), cplxIn[i + 1].imag(), cplxIn[i + 2].imag(), cplxIn[i + 3].imag(), cplxIn[i + 4].imag(), cplxIn[i + 5].imag(), cplxIn[i + 6].imag(), cplxIn[i + 7].imag()};
__m256 x4 = _mm256_load_ps (re);
__m256 y4 = _mm256_load_ps (im);
__m256 b4 = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (x4,x4), _mm256_mul_ps (y4,y4)));
_mm256_storeu_ps (absOut + i, b4);
}
}
However manually shuffling the values this way seemed like a task that could be speeded up somehow. Now this is the solution I came up with, that runs 2 - 3 times faster in a quick test compiled by clang with full optimization:
#include <complex>
#include <immintrin.h>
void computeAbsolute2 (const std::complex<float>* __restrict cplxIn, float* __restrict absOut, const int length)
{
for (int i = 0; i < length; i += 8)
{
// load 8 complex values (--> 16 floats overall) into two SIMD registers
__m256 inLo = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i ));
__m256 inHi = _mm256_loadu_ps (reinterpret_cast<const float*> (cplxIn + i + 4));
// seperates the real and imaginary part, however values are in a wrong order
__m256 re = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (2, 0, 2, 0));
__m256 im = _mm256_shuffle_ps (inLo, inHi, _MM_SHUFFLE (3, 1, 3, 1));
// do the heavy work on the unordered vectors
__m256 abs = _mm256_sqrt_ps (_mm256_add_ps (_mm256_mul_ps (re, re), _mm256_mul_ps (im, im)));
// reorder values prior to storing
__m256d ordered = _mm256_permute4x64_pd (_mm256_castps_pd(abs), _MM_SHUFFLE(3,1,2,0));
_mm256_storeu_ps (absOut + i, _mm256_castpd_ps(ordered));
}
}
I think I'll go with that implementation if no one comes up with a faster solution
This compiles efficiently with gcc and clang (on the Godbolt compiler explorer).
It's really hard (if possible) to write "highly optimized AVX2" version of complex abs since the way complex numbers are defined in the standard prevents (specifically due to all inf/nan corner cases) a lot of optimization.
However, if you don't care about the correctness you can just use -ffast-math and some compilers would optimize the code for you. See gcc output: https://godbolt.org/z/QbZlBI
You can also take this output and create your own abs function with inline assembly.
But yes, as was already mentioned, if you really need performance, you probably want to swap std::complex for something else.
I was able to get a decent output for your specific case with all the required shuffles by manually filling small re and im arrays. See: https://godbolt.org/z/sWAAXo
This could be trivially extended for ymm registers.
Anyway, here is the ultimate solution adapted from this SO answer which uses intrinsics in combination with clever compiler optimizations:
#include <complex>
#include <cassert>
#include <immintrin.h>
static inline void cabs_soa4(const float *re, const float *im, float *b) {
__m128 x4 = _mm_loadu_ps(re);
__m128 y4 = _mm_loadu_ps(im);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
void computeAbsolute (const std::complex<float>* src,
float* realValuedDestinationVec,
int vecLength)
{
for (int i = 0; i < vecLength; i += 4) {
float re[4] = {src[i].real(), src[i + 1].real(), src[i + 2].real(), src[i + 3].real()};
float im[4] = {src[i].imag(), src[i + 1].imag(), src[i + 2].imag(), src[i + 3].imag()};
cabs_soa4(re, im, realValuedDestinationVec);
}
}
which compiles to simple
_Z15computeAbsolutePKSt7complexIfEPfi:
test edx, edx
jle .L5
lea eax, [rdx-1]
shr eax, 2
sal rax, 5
lea rax, [rdi+32+rax]
.L3:
vmovups xmm0, XMMWORD PTR [rdi]
vmovups xmm2, XMMWORD PTR [rdi+16]
add rdi, 32
vshufps xmm1, xmm0, xmm2, 136
vmulps xmm1, xmm1, xmm1
vshufps xmm0, xmm0, xmm2, 221
vfmadd132ps xmm0, xmm1, xmm0
vsqrtps xmm0, xmm0
vmovups XMMWORD PTR [rsi], xmm0
cmp rax, rdi
jne .L3
.L5:
ret
https://godbolt.org/z/Yu64Wg

best way to shuffle across AVX lanes?

There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere.
I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations:
on entry:
x0 contains {a0, a1}
x1 contains {a2, a3}
x2 contains {a4, a5}
x3 contains {a6, a7}
on exit:
y0 contains {a0, a1, a2, a3}
y1 contains {a1, a2, a3, a4}
y2 contains {a2, a3, a4, a5}
y3 contains {a3, a4, a5, a6}
y4 contains {a4, a5, a6, a7}
My implementation here below is quite slow. Is there a better way?
y0 = _mm256_set_m128d(x1, x0);
__m128d lo = _mm_shuffle_pd(x0, x1, 1);
__m128d hi = _mm_shuffle_pd(x1, x2, 1);
y1 = _mm256_set_m128d(hi, lo);
y2 = _mm256_set_m128d(x2, x1);
lo = hi;
hi = _mm_shuffle_pd(x2, x3, 1);
y3 = _mm256_set_m128d(hi, lo);
y4 = _mm256_set_m128d(x3, x2);
With inputs in registers, you can do it in 5 shuffle instructions:
3x vinsertf128 to create y0, y2, and y4 by concatenating 2 xmm registers each.
2x vshufpd (in-lane shuffles) between those results to create y1 and y3.
Notice that the low lanes of y0 and y2 contain a1 and a2, the elements needed for the low lane of y1. And the same shuffle also works for the high lane.
#include <immintrin.h>
void merge(__m128d x0, __m128d x1, __m128d x2, __m128d x3,
__m256d *__restrict y0, __m256d *__restrict y1,
__m256d *__restrict y2, __m256d *__restrict y3, __m256d *__restrict y4)
{
*y0 = _mm256_set_m128d(x1, x0);
*y2 = _mm256_set_m128d(x2, x1);
*y4 = _mm256_set_m128d(x3, x2);
// take the high element from the first vector, low element from the 2nd.
*y1 = _mm256_shuffle_pd(*y0, *y2, 0b0101);
*y3 = _mm256_shuffle_pd(*y2, *y4, 0b0101);
}
Compiles pretty nicely (with gcc and clang -O3 -march=haswell on Godbolt) to:
merge(double __vector(2), double __vector(2), double __vector(2), double __vector(2), double __vector(4)*, double __vector(4)*, double __vector(4)*, double __vector(4)*, double __vector(4)*):
vinsertf128 ymm0, ymm0, xmm1, 0x1
vinsertf128 ymm3, ymm2, xmm3, 0x1
vinsertf128 ymm1, ymm1, xmm2, 0x1
# vmovapd YMMWORD PTR [rdi], ymm0
vshufpd ymm0, ymm0, ymm1, 5
# vmovapd YMMWORD PTR [rdx], ymm1
vshufpd ymm1, ymm1, ymm3, 5
# vmovapd YMMWORD PTR [r8], ymm3
# vmovapd YMMWORD PTR [rsi], ymm0
# vmovapd YMMWORD PTR [rcx], ymm1
# vzeroupper
# ret
I commented out the stores and stuff that would go away on inlining, so we really do just have the 5 shuffle instructions, vs. 9 shuffle instructions for the code in your question. (Also included in the Godbolt compiler explorer link).
This is very good on AMD, where vinsertf128 is super-cheap (because 256-bit registers are implemented as 2x 128-bit halves, so it's just a 128-bit copy without needing a special shuffle port.) 256-bit lane-crossing shuffles are slow on AMD, but in-lane 256-bit shuffles like vshufpd is just 2 uops.
On Intel it's pretty good, but mainstream Intel CPUs with AVX only have 1 per clock shuffle throughput for 256-bit or FP shuffles. (Sandybridge and earlier have more throughput for integer 128-bit shuffles, but AVX2 CPUs dropped the extra shuffle units, and they didn't help anyway for this.)
So Intel CPUs can't exploit the instruction-level parallelism at all, but it's only 5 uops total which is nice. That's the minimum possible, because you need 5 results.
But especially if the surrounding code also bottlenecks on shuffles, it's worth considering a store/reload strategy with just 4 stores and 5 overlapping vector loads. Or maybe 2x vinsertf128 to construct y0 and y4, then 2x 256-bit stores + 3 overlapping reloads. That could let out-of-order exec get started on dependent instructions using just y0 or y4 while the store-forwarding stall resolved for y1..3.
Especially if you don't care much about Intel first-gen Sandybridge where unaligned 256-bit vector loads are less efficient. (Note that you'd want to compile with gcc -mtune=haswell to turn off the -mavx256-split-unaligned-load default / sandybridge tuning, if you're using GCC. Regardless of the compiler, -march=native is a good idea if making binaries to run on the machine where you compile it, to take full advantage of instruction sets and set tuning options.)
But if total uop throughput from the front-end is more where the bottleneck lies, then the shuffle implementation is best.
(See https://agner.org/optimize/ and other performance links in the x86 tag wiki for more about performance tuning. Also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, but really Agner Fog's guide is a more in-depth guide that explains what throughput vs. latency is actually about.)
I do not even need to save, as data is also already available in contiguous memory.
Then simply loading it with 5 overlapping loads is almost certainly the most efficient thing you could do.
Haswell can do 2 loads per clock from L1d, or less when any cross a cache-line boundary. So if you can align your block by 64, it's perfectly efficient with no cache-line-splits at all. Cache misses are slow, but reloading hot data from L1d cache is very cheap, and modern CPUs with AVX support generally have efficient unaligned-load support.
(Like I said earlier, if using gcc make sure you compile with -march=haswell or -mtune=haswell, not just -mavx, to avoid gcc's -mavx256-split-unaligned-load.)
4 loads + 1 vshufpd (y0, y2) might be a good way to balance load port pressure with ALU pressure, depending on bottlenecks in the surrounding code. Or even 3 loads + 2 shuffles, if the surrounding code is low on shuffle port pressure.
they are in registers from previous calculations which required them to be loaded.
If that previous calculation still has the source data in registers, you could have done 256-bit loads in the first place and just used their 128-bit low halves for the earlier calc. (An XMM register is the low 128 of the corresponding YMM register, and reading them doesn't disturb the upper lanes, so _mm256_castpd256_pd128 compiles to zero asm instructions.)
Do 256-bit loads for y0,y2, and y4, and use their low halves as x0, x1, and x2. (Construct y1 and y3 later with unaligned loads or shuffles).
Only x3 isn't already the low 128 bits of a 256-bit vector you also want.
Ideally a compiler would already notice this optimization when you do a _mm_loadu_pd and a _mm256_loadu_pd from the same address, but probably you need to hand-hold it by doing
__m256d y0 = _mm256_loadu_pd(base);
__m128d x0 = _mm256_castpd256_pd128(y0);
and so on, and either an extract ALU intrinsic (_mm256_extractf128_pd) or a 128-bit load for x3, depending on the surrounding code. If it's only needed once, letting it fold into a memory operand for whatever instruction uses it might be best.
Potential downside: slightly higher latency before the 128-bit calculation can start, or several cycles if the 256-bit loads were cache-line crossing where 128-bit loads weren't. But if your block of data is aligned by 64 bytes, this won't happen.

Fastest way to perform AVX inner product operations with mixed (float, double) input vectors

I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!
I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)
__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
__m256 X = _mm256_loadu_ps(x + i);
__m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
__m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));
__m256 Y = _mm256_set_m128(yD43, yD21);
return _mm256_fmadd_ps(X, Y, res_prev);
}
I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.
EDIT: this is the disassembly of the function:
vcvtpd2ps xmm3,ymmword ptr [rdx+r8*8+20h]
vcvtpd2ps xmm2,ymmword ptr [rdx+r8*8]
vmovups ymm0,ymmword ptr [rcx+r8*4]
vinsertf128 ymm3,ymm2,xmm3,1
vfmadd213ps ymm0,ymm3,ymmword ptr [r9]
EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+30h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8+20h]
vmovlhps xmm3,xmm1,xmm0
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+10h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8]
vmovlhps xmm2,xmm1,xmm0
vperm2f128 ymm3,ymm2,ymm3,20h
vmulps ymm0,ymm3,ymmword ptr [rcx+r8*4]
vaddps ymm0,ymm0,ymmword ptr [r9]

Fast weighted mean & variance of 10 bins

I would like to speed up a part of my code but I don't think there is a possible better way to do the following calculation:
float invSum = 1.0f / float(sum);
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
var += f * hwk::sqr(diff);
}
numBins in the for-loops is typically 10 but this bit of code is called very often (frequency of 80 frames per seconds, called at least 8 times per frame)
I tried to use some SSE methods but it is only slightly speeding up this code. I think I could avoid calculating twice midPoint but I am not sure how. Is there a better way to compute fmean and var?
Here is the SSE code:
// make hist contain a multiple of 4 valid values
for (int i = numBins; i < ((numBins + 3) & ~3); i++)
hist[i] = 0;
// find sum of bins in inHist
__m128i iSum4 = _mm_set1_epi32(0);
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
iSum4 = _mm_add_epi32(iSum4, a);
}
int iSum = iSum4.m128i_i32[0] + iSum4.m128i_i32[1] + iSum4.m128i_i32[2] + iSum4.m128i_i32[3];
//float stdevB, meanB;
if (iSum == 0.0f)
{
stdev = 0.0;
mean = 0.0;
}
else
{
// Set histVec to normalised values in inHist
__m128 invSum = _mm_set1_ps(1.0f / float(iSum));
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
__m128 b = _mm_cvtepi32_ps(a);
__m128 c = _mm_mul_ps(b, invSum);
_mm_store_ps(&histVec[i], c);
}
float binSize = 256.0f / (float)numBins;
float halfBinSize = binSize * 0.5f;
float binOffset = halfBinSize;
__m128 binSizeMask = _mm_set1_ps(binSize);
__m128 binOffsetMask = _mm_set1_ps(binOffset);
__m128 fmean4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i += 4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
fmean4 = _mm_add_ps(fmean4, _mm_mul_ps(histVec4, midPoint4));
}
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 01 23 01 23
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 0123 0123 0123 0123
float fmean = fmean4.m128_f32[0];
//fmean4 = _mm_set1_ps(fmean);
__m128 var4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i+=4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
__m128 diff4 = _mm_sub_ps(midPoint4, fmean4);
var4 = _mm_add_ps(var4, _mm_mul_ps(histVec4, _mm_mul_ps(diff4, diff4)));
}
var4 = _mm_hadd_ps(var4, var4); // 01 23 01 23
var4 = _mm_hadd_ps(var4, var4); // 0123 0123 0123 0123
float var = var4.m128_f32[0];
stdev = sqrt(var);
mean = fmean;
}
I might be doing something wrong since I dont have a lot of improvement as I was expecting.
Is there something in the SSE code that might possibly slow down the process?
(editor's note: the SSE part of this question was originally asked as https://stackoverflow.com/questions/31837817/foor-loop-optimisation-sse-comparison, which was closed as a duplicate.)
I only just realized that your data array starts out as an array of int, since you didn't have declarations in your code. I can see in the SSE version that you start with integers, and only store a float version of it later.
Keeping everything integer will let us do the loop-counter-vector with a simple ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen's answer has some transformations that should let it optimize a lot better. Especially, the auto-vectorizer should be able to do more even without -ffast-math. In fact, it does quite well. You could do better with intrinsics, esp. saving some vector 32bit multiplies and shortening the dependency chain.
My old answer, based on just trying to optimize your code as written, assuming FP input:
You may be able to combine all 3 loops into one, using the algorithm #Jason linked to. It might not be profitable, though, since it involves a division. For small numbers of bins, probably just loop multiple times.
Start by reading the guides at http://agner.org/optimize/. A couple of the techniques in his Optimising Assembly guide will speed up your SSE attempt (which I edited into this question for you).
combine your loops where possible, so you do more with the data for each time it's loaded / stored.
multiple accumulators to hide the latency of loop-carried dependency chains. (Even FP add takes 3 cycles on recent Intel CPUs.) This won't apply for really short arrays like your case.
instead of int->float conversion on every iteration, use a float loop counter as well as the int loop counter. (add a vector of _mm_set1_ps(4.0f) every iteration.) _mm_set... with variable args is something to avoid in loops, when possible. It takes several instructions (esp. when each arg to setr has to be calculated separately.)
gcc -O3 manages to auto-vectorize the first loop, but not the others. With -O3 -ffast-math, it auto-vectorizes more. -ffast-math allows it to do FP operations in a different order than the code specifies. e.g. adding up the array in 4 elements of a vector, and only combining the 4 accumulators at the end.
Telling gcc that the input pointer is aligned by 16 lets gcc auto-vectorize with a lot less overhead (no scalar loops for unaligned portions).
// return mean
float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p)
{
numBins += 3;
numBins &= ~3; // round up to multiple of 4. This is just a quick hack to make the code fast and simple.
histVec = (float*)__builtin_assume_aligned(histVec, 16);
float invSum = 1.0f / float(sum);
float var = 0, fmean = 0;
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
// var += f * hwk::sqr(diff);
var += f * (diff * diff);
}
*variance_p = var;
return fmean;
}
gcc generates some weird code for the 2nd loop.
# broadcasting fmean after the 1st loop
subss %xmm0, %xmm2 # fmean, D.2466
shufps $0, %xmm2, %xmm2 # vect_cst_.16
.L5: ## top of 2nd loop
movdqa %xmm3, %xmm5 # vect_vec_iv_.8, vect_vec_iv_.8
cvtdq2ps %xmm3, %xmm3 # vect_vec_iv_.8, vect__32.9
movq %rcx, %rsi # D.2465, D.2467
addq $1, %rcx #, D.2465
mulps %xmm1, %xmm3 # vect_cst_.11, vect__33.10
salq $4, %rsi #, D.2467
paddd %xmm7, %xmm5 # vect_cst_.7, vect_vec_iv_.8
addps %xmm2, %xmm3 # vect_cst_.16, vect_diff_39.15
mulps %xmm3, %xmm3 # vect_diff_39.15, vect_powmult_53.17
mulps (%rdi,%rsi), %xmm3 # MEM[base: histVec_10, index: _107, offset: 0B], vect__41.18
addps %xmm3, %xmm4 # vect__41.18, vect_var_42.19
cmpq %rcx, %rax # D.2465, bnd.26
ja .L8 #, ### <--- This is insane.
haddps %xmm4, %xmm4 # vect_var_42.19, tmp160
haddps %xmm4, %xmm4 # tmp160, vect_var_42.21
.L2:
movss %xmm4, (%rdx) # var, *variance_p_44(D)
ret
.p2align 4,,10
.p2align 3
.L8:
movdqa %xmm5, %xmm3 # vect_vec_iv_.8, vect_vec_iv_.8
jmp .L5 #
So instead of just jumping back to the top every iteration, gcc decides to jump ahead to copy a register, and then unconditionally jmp back to the top of the loop. The uop loop buffer may remove the front-end overhead of this sillyness, but gcc should have structured the loop so it didn't copy xmm5->xmm3 and then xmm3->xmm5 every iteration, because that's silly. It should have the conditional jump just go to the top of the loop.
Also note the technique gcc used to get a float version of the loop counter: start with an integer vector of 1 2 3 4, and add set1_epi32(4). Use that as an input for packed int->float cvtdq2ps. On Intel HW, that instruction runs on the FP-add port, and has 3 cycle latency, same as packed FP add. gcc prob. would have done better to just add a vector of set1_ps(4.0), even though this creates a 3-cycle loop-carried dependency chain, instead of 1 cycle vector int add, with a 3 cycle convert forking off on every iteration.
small iteration count
You say this will often be used on exactly 10 bins? A specialized version for just 10 bins could give a big speedup, by avoiding all the loop overhead and keeping everything in registers.
With that small a problem size, you can have the FP weights just sitting there in memory, instead of re-computing them with integer->float conversion every time.
Also, 10 bins is going to mean a lot of horizontal operations relative to the amount of vertical operations, since you only have 2 and a half vectors worth of data.
If exactly 10 is really common, specialize a version for that. If under-16 is common, specialize a version for that. (They can and should share the const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; array.)
You probably will want to use intrinsics for the specialized small-problem versions, rather than auto-vectorization.
Having zero-padding after the end of the useful data in your array might still be a good idea in your specialized version(s). However, you can load the last 2 floats and clear the upper 64b of a vector register with a movq instruction. (__m128i _mm_cvtsi64_si128 (__int64 a)). Cast this to __m128 and you're good to go.
As peterchen mentioned, these operations are very trivial for current desktop processors. The function is linear, i.e. O(n). What's the typical size of numBins? If it's rather large (say, over 1000000), parallelization will help. This could be simple using a library like OpenMP. If numBins starts approaching MAXINT, you may consider GPGPU as an option (CUDA/OpenCL).
All that considered, you should try profiling your application. Chances are good that, if there is a performance constraint, it's not in this method. Michael Abrash's definition of "high-performance code" has helped me greatly in determining if/when to optimize:
Before we can create high-performance code, we must understand what high performance is. The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned. In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless. Notice that the above definition most emphatically does not say anything about making the software as fast as possible.
Reference:
The Graphics Programming Black Book
The overall function to be calculated is
std = sqrt(SUM_i { hist[i]/sum * (midpoint_i - mean_midpoint)^2 })
Using the identity
Var (aX + b) = Var (X) * a^2
one can reduce the complexity of the overall operation considerably
1) midpoint of a bin doesn't need offset b
2) no need to prescale by bin array elements with bin width
and
3) no need to normalize histogram entries with reciprocal of sum
The optimized calculation goes as follows
float calcVariance(int histBin[], float binWidth)
{
int i;
int sum = 0;
int mid = 0;
int var = 0;
for (i = 0; i < 10; i++)
{
sum += histBin[i];
mid += i*histBin[i];
}
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
for (i = 0; i < 10; i++)
{
int diff = i * sum - mid; // because mid is prescaled by sum
var += histBin[i] * diff * diff;
}
return sqrt(float(var) / (float)(sum * sum * sum)) * binWidth;
}
Minor changes are required if it's float histBin[];
Also I second padding histBin size to a multiple of 4 for better vectorization.
EDIT
Another way to calculate this with floats in the inner loop:
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
float var = 0.0f;
for (i = 0; i < 10; i++)
{
float diff = (float)i - mid_sum;
var += (float)histBin[i] * diff * diff;
}
return sqrt(var * inv_sum) * binWidth;
Perform the scaling on the global results only and keep integers as long as possible.
Group all computation in a single loop, using Σ(X-m)²/N = ΣX²/N - m².
// Accumulate the histogram
int mean= 0, var= 0;
for (int i = 0; i < numBins; ++i)
{
mean+= i * histVec[i];
var+= i * i * histVec[i];
}
// Compute the reduced mean and variance
float fmean= (float(mean) / sum);
float fvar= float(var) / sum - fmean * fmean;
// Rescale
fmean= fmean * binSize + binOffset;
fvar= fvar * binSize * binSize;
The required integer type will depend on the maximum value in the bins. The SSE optimization of the loop can exploit the _mm_madd_epi16 instruction.
If the number of bins is a small as 10, consider fully unrolling the loop. Precompute the i and i² vectors in a table.
In the lucky case that the data fits in 16 bits and the sums in 32 bits, the accumulation is done with something like
static short I[16]= { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0 };
static short I2[16]= { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 0, 0, 0, 0, 0, 0 };
// First group
__m128i i= _mm_load_si128((__m128i*)&I[0]);
__m128i i2= _mm_load_si128((__m128i*)&I2[0]);
__m128i h= _mm_load_si128((__m128i*)&inHist[0]);
__m128i mean= _mm_madd_epi16(i, h);
__m128i var= _mm_madd_epi16(i2, h);
// Second group
i= _mm_load_si128((__m128i*)&I[8]);
i2= _mm_load_si128((__m128i*)&I2[8]);
h= _mm_load_si128((__m128i*)&inHist[8]);
mean= _mm_add_epi32(mean, _mm_madd_epi16(i, h));
var= _mm_add_epi32(var, _mm_madd_epi16(i2, h));
CAUTION: unchecked

FPU,SSE single floating point. Which is faster? sub or mul

Just tell me which one is faster: sub or mul?
My target platform is X86; FPU and SSE.
example:
'LerpColorSolution1' uses multiply.
'LerpColorSolution2' uses subtract.
which is faster ?
void LerpColorSolution1(const float* a, const float* b, float alpha, float* out)
{
out[0] = a[0] + (b[0] - a[0]) * alpha;
out[1] = a[1] + (b[1] - a[1]) * alpha;
out[2] = a[2] + (b[2] - a[2]) * alpha;
out[3] = a[3] + (b[3] - a[3]) * alpha;
}
void LerpColorSolution2(const float* a, const float* b, float alpha, float* out)
{
float f = 1.0f - alpha;
out[0] = a[0]*f + b[0] * alpha;
out[1] = a[1]*f + b[1] * alpha;
out[2] = a[2]*f + b[2] * alpha;
out[3] = a[3]*f + b[3] * alpha;
}
Thanks to all ;)
Just for fun: assuming that you (or your compiler) vectorize both of your approaches (because of course you would if you're chasing performance), and you're targeting a recent x86 processor...
A direct translation of "LerpColorSolution1" into AVX instructions is as follows:
VSUBPS dst, a, b // a[] - b[]
VSHUFPS alpha, alpha, alpha, 0 // splat alpha
VMULPS dst, alpha, dst // alpha*(a[] - b[])
VADDPS dst, a, dst // a[] + alpha*(a[] - b[])
The long latency chain for this sequence is sub-mul-add, which has a total latency of 3+5+3 = 11 cycles on most recent Intel processors. Throughput (assuming that you do nothing but these operations) is limited by port 1 utilization, with a theoretical peak of one LERP every two cycles. (I'm intentionally overlooking load/store traffic and focusing solely on the mathematical operation being performed here).
If we look at your "LerpColorSolution2":
VSHUFPS alpha, alpha, alpha, 0 // splat alpha
VSUBPS dst, one, alpha // 1.0f - alpha, assumes "1.0f" kept in reg.
VMULPS tmp, alpha, b // alpha*b[]
VMULPS dst, dst, a // (1-alpha)*a[]
VADDPS dst, dst, tmp // (1-alpha)*a[] + alpha*b[]
Now the long latency chain is shuffle-sub-mul-add, which has a total latency of 1+3+5+3 = 12 cycles; Throughput is now limited by ports 0 and 1, but still has a peak of one LERP every two cycles. You need to retire one additional µop for each LERP operation, which may make throughput slightly slower depending on the surrounding context.
So your first solution is slightly better; (which isn't surprising -- even without this detail of analysis, the rough guideline "fewer operations is better" is a good rule of thumb).
Haswell tilts things significantly in favor of the first solution; using FMA it requires only one µop on each of ports 0,1, and 5, allowing for a theoretical throughput of one LERP per cycle; while FMA also improves solution 2, it still requires four µops, including three that need to execute on port 0 or 1. This limits solution 2 to a theoretical peak of one LERP every 1.5 cycles -- 50% slower than solution 1.