Maximum optimization of element wise multiplication via ARM NEON assembly

Maximum optimization of element wise multiplication via ARM NEON assembly - c++

I'm optimizing an element wise multiplication of two single dimensional arrays for a dual Cortex-A9 processor. Linux is running on the board and I'm using the GCC 4.5.2 compiler.
So the following is my C++ inline assembler function. src1, src2 and dst are 16 byte aligned.
Update: Testable Code:
void Multiply(
const float* __restrict__ src1,
const float* __restrict__ src2,
float* __restrict__ dst,
const unsigned int width,
const unsigned int height)
{
int loopBound = (width * height) / 4;
asm volatile(
".loop: \n\t"
"vld1.32 {q1}, [%[src1]:128]! \n\t"
"vld1.32 {q2}, [%[src2]:128]! \n\t"
"vmul.f32 q0, q1, q2 \n\t"
"vst1.32 {q0}, [%[dst]:128]! \n\t"
"subs %[lBound], %[lBound], $1 \n\t"
"bge .loop \n\t"
:
:[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2),
[lBound] "r" (loopBound)
:"memory", "d0", "d1", "d2", "d3", "d4", "d5
);
}
//The following function describes how to test the element wise multiplication
void Test()
{
const unsigned int width = 1024, height = 1024;
float* src1 __attribute__((aligned(16))) = new float[width * height];
float* src2 __attribute__((aligned(16))) = new float[width * height];
float* dst __attribute__((aligned(16))) = new float[width * height];
for(unsigned int i = 0; i < (width * height); i++)
{
src1[i] = (float)rand();
src2[i] = (float)rand();
}
Multiply(src1, src2, dst, width, height);
std::cout << dst[0] << std::endl;
}
The calculation of 1024*1024 values takes ~0.016 s. (Two threads - each thread calculates a half of the array). Naively interpreted, the calculation of one iteration takes 122 cycles. This seems to be a bit slow. But where is the bottleneck?
I even tried the pld command for preloading elements in the L2 cache, "unrolling" the loop by calculating up to 20 values per iteration and reordering the instructions to make shure the processor is not waiting for memory. I didn't get that much speedup (max 0.001 s faster).
Do you have any suggestions for speeding up the calculation?

I don't really know that much about the NEON. However, I think that you have data dependencies that cause performance issues. I would suggest you prime the loop with some loads and then place them between the multiply and store. I think the store is probably blocking until the multiply is done.
asm volatile(
"vld1.32 {q1}, [%[src1]:128]! \n\t"
"vld1.32 {q2}, [%[src2]:128]! \n\t"
".loop: \n\t"
"vmul.f32 q0, q1, q2 \n\t"
"vld1.32 {q1}, [%[src1]:128]! \n\t"
"vld1.32 {q2}, [%[src2]:128]! \n\t"
"vst1.32 {q0}, [%[dst]:128]! \n\t"
"subs %[lBound], %[lBound], $1 \n\t"
"bge .loop \n\t"
:
:[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2),
[lBound] "r" (loopBound)
:"memory", "d0", "d1", "d2", "d3", "d4", "d5
);
This way you should be able to parallel the loads with the multiply. You will need to over-allocate the source arrays or change the loop index and do a final multiply and store. If the NEON ops are not affecting the condition codes, you can re-order the subs as well and place it earlier.
Edit: In fact, the Cortex A-9 Media processing Engine document recommends interleaving ARM and NEON instructions as they can execute in parallel. Also, the NEON instructions seem to set FPSCR and not the ARM CPSR so re-ordering the subs would decrease the execution time. You may also cache align the loop.

Related

C++ Optimize Memory Read Speed

I'm creating an int (32 bit) vector with 1024 * 1024 * 1024 elements like so:
std::vector<int> nums;
for (size_t i = 0; i < 1024 * 1024 * 1024; i++) {
nums.push_back(rand() % 1024);
}
which holds 4 GB of random data at that point. And then I'm simply summing up all the elements in the vector like so:
uint64_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn++) {
total += *cn;
}
This takes about ~0.18 seconds which means the data is processed at around 22.2 GB/s. I'm running this on an M1 with a much higher memory bandwidth of about 60GB/s. Is there a way to make the above code run faster on a single core?
EDIT:
Manual SIMD version:
int32x4_t simd_total = vmovq_n_s32(0);
for (auto cn = nums.begin(); cn < nums.end()-3; cn +=4) {
const int32_t v[4] = {cn[0], cn[1], cn[2], cn[3]}
simd_total = vaddq_s32(simd_total, vld1q_s32(v));
}
return vaddvq_s32(simd_total);
The SIMD version has the same performance as the non-manual-SIMD version.
EDIT 2:
Alright, so I changed the vector elements to uint32_t and also changed the result type to uint32_t(as suggested by #Peter Cordes):
uint32_t sum_ints_32(const std::vector<uint32_t>& nums) {
uint32_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn++) {
total += *cn;
}
return total;
}
This runs much faster (~45 GB/s). This is the disassembly:
0000000100002218 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
100002218: a940200c ldp x12, x8, [x0]
10000221c: eb08019f cmp x12, x8
100002220: 54000102 b.cs 100002240 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x28> // b.hs, b.nlast
100002224: aa2c03e9 mvn x9, x12
100002228: 8b090109 add x9, x8, x9
10000222c: f1006d3f cmp x9, #0x1b
100002230: 540000c8 b.hi 100002248 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x30> // b.pmore
100002234: 52800000 mov w0, #0x0 // #0
100002238: aa0c03e9 mov x9, x12
10000223c: 14000016 b 100002294 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x7c>
100002240: 52800000 mov w0, #0x0 // #0
100002244: d65f03c0 ret
100002248: d342fd29 lsr x9, x9, #2
10000224c: 9100052a add x10, x9, #0x1
100002250: 927ded4b and x11, x10, #0x7ffffffffffffff8
100002254: 8b0b0989 add x9, x12, x11, lsl #2
100002258: 9100418c add x12, x12, #0x10
10000225c: 6f00e400 movi v0.2d, #0x0
100002260: aa0b03ed mov x13, x11
100002264: 6f00e401 movi v1.2d, #0x0
100002268: ad7f8d82 ldp q2, q3, [x12, #-16]
10000226c: 4ea08440 add v0.4s, v2.4s, v0.4s
100002270: 4ea18461 add v1.4s, v3.4s, v1.4s
100002274: 9100818c add x12, x12, #0x20
100002278: f10021ad subs x13, x13, #0x8
10000227c: 54ffff61 b.ne 100002268 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x50> // b.any
100002280: 4ea08420 add v0.4s, v1.4s, v0.4s
100002284: 4eb1b800 addv s0, v0.4s
100002288: 1e260000 fmov w0, s0
10000228c: eb0b015f cmp x10, x11
100002290: 540000a0 b.eq 1000022a4 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x8c> // b.none
100002294: b840452a ldr w10, [x9], #4
100002298: 0b000140 add w0, w10, w0
10000229c: eb08013f cmp x9, x8
1000022a0: 54ffffa3 b.cc 100002294 <__Z11sum_ints_32RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x7c> // b.lo, b.ul, b.last
1000022a4: d65f03c0 ret
I also rewrote the Manual-SIMD version:
uint32_t sum_ints_simd_2(const std::vector<uint32_t>& nums) {
uint32x4_t simd_total = vmovq_n_u32(0);
for (auto cn = nums.begin(); cn < nums.end()-3; cn +=4) {
const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
simd_total = vaddq_u32(simd_total, vld1q_u32(v));
}
return vaddvq_u32(simd_total);
}
which still runs 2x slower than the non-manual-SIMD version and results in the following disassembly:
0000000100002464 <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
100002464: a9402408 ldp x8, x9, [x0]
100002468: d1003129 sub x9, x9, #0xc
10000246c: 6f00e400 movi v0.2d, #0x0
100002470: eb09011f cmp x8, x9
100002474: 540000c2 b.cs 10000248c <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x28> // b.hs, b.nlast
100002478: 6f00e400 movi v0.2d, #0x0
10000247c: 3cc10501 ldr q1, [x8], #16
100002480: 4ea08420 add v0.4s, v1.4s, v0.4s
100002484: eb09011f cmp x8, x9
100002488: 54ffffa3 b.cc 10000247c <__Z15sum_ints_simd_2RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x18> // b.lo, b.ul, b.last
10000248c: 4eb1b800 addv s0, v0.4s
100002490: 1e260000 fmov w0, s0
100002494: d65f03c0 ret
To reach the same speed as the auto-vectorized version, we can use a uint32x4x2 instead of uint32x4 for our manual-SIMD version:
uint32_t sum_ints_simd_3(const std::vector<uint32_t>& nums) {
uint32x4x2_t simd_total;
simd_total.val[0] = vmovq_n_u32(0);
simd_total.val[1] = vmovq_n_u32(0);
for (auto cn = nums.begin(); cn < nums.end()-7; cn +=8) {
const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
const uint32_t v2[4] = { cn[4], cn[5], cn[6], cn[7] };
simd_total.val[0] = vaddq_u32(simd_total.val[0], vld1q_u32(v));
simd_total.val[1] = vaddq_u32(simd_total.val[1], vld1q_u32(v2));
}
return vaddvq_u32(simd_total.val[0]) + vaddvq_u32(simd_total.val[1]);
}
And to gain even more speed we can leverage uint32x4x4 (which gets us about ~53 GB/s):
uint32_t sum_ints_simd_4(const std::vector<uint32_t>& nums) {
uint32x4x4_t simd_total;
simd_total.val[0] = vmovq_n_u32(0);
simd_total.val[1] = vmovq_n_u32(0);
simd_total.val[2] = vmovq_n_u32(0);
simd_total.val[3] = vmovq_n_u32(0);
for (auto cn = nums.begin(); cn < nums.end()-15; cn +=16) {
const uint32_t v[4] = { cn[0], cn[1], cn[2], cn[3] };
const uint32_t v2[4] = { cn[4], cn[5], cn[6], cn[7] };
const uint32_t v3[4] = { cn[8], cn[9], cn[10], cn[11] };
const uint32_t v4[4] = { cn[12], cn[13], cn[14], cn[15] };
simd_total.val[0] = vaddq_u32(simd_total.val[0], vld1q_u32(v));
simd_total.val[1] = vaddq_u32(simd_total.val[1], vld1q_u32(v2));
simd_total.val[2] = vaddq_u32(simd_total.val[2], vld1q_u32(v3));
simd_total.val[3] = vaddq_u32(simd_total.val[3], vld1q_u32(v4));
}
return vaddvq_u32(simd_total.val[0])
+ vaddvq_u32(simd_total.val[1])
+ vaddvq_u32(simd_total.val[2])
+ vaddvq_u32(simd_total.val[3]);
}
which gets us the following disassembly:
0000000100005e34 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE>:
100005e34: a9402408 ldp x8, x9, [x0]
100005e38: d100f129 sub x9, x9, #0x3c
100005e3c: 6f00e403 movi v3.2d, #0x0
100005e40: 6f00e402 movi v2.2d, #0x0
100005e44: 6f00e401 movi v1.2d, #0x0
100005e48: 6f00e400 movi v0.2d, #0x0
100005e4c: eb09011f cmp x8, x9
100005e50: 540001c2 b.cs 100005e88 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x54> // b.hs, b.nlast
100005e54: 6f00e400 movi v0.2d, #0x0
100005e58: 6f00e401 movi v1.2d, #0x0
100005e5c: 6f00e402 movi v2.2d, #0x0
100005e60: 6f00e403 movi v3.2d, #0x0
100005e64: ad401504 ldp q4, q5, [x8]
100005e68: ad411d06 ldp q6, q7, [x8, #32]
100005e6c: 4ea38483 add v3.4s, v4.4s, v3.4s
100005e70: 4ea284a2 add v2.4s, v5.4s, v2.4s
100005e74: 4ea184c1 add v1.4s, v6.4s, v1.4s
100005e78: 4ea084e0 add v0.4s, v7.4s, v0.4s
100005e7c: 91010108 add x8, x8, #0x40
100005e80: eb09011f cmp x8, x9
100005e84: 54ffff03 b.cc 100005e64 <__Z15sum_ints_simd_4RKNSt3__16vectorIjNS_9allocatorIjEEEE+0x30> // b.lo, b.ul, b.last
100005e88: 4eb1b863 addv s3, v3.4s
100005e8c: 1e260068 fmov w8, s3
100005e90: 4eb1b842 addv s2, v2.4s
100005e94: 1e260049 fmov w9, s2
100005e98: 0b080128 add w8, w9, w8
100005e9c: 4eb1b821 addv s1, v1.4s
100005ea0: 1e260029 fmov w9, s1
100005ea4: 0b090108 add w8, w8, w9
100005ea8: 4eb1b800 addv s0, v0.4s
100005eac: 1e260009 fmov w9, s0
100005eb0: 0b090100 add w0, w8, w9
100005eb4: d65f03c0 ret
Crazy stuff

Does -march=native help? IDK if there are any SIMD features that Apple clang won't already take advantage on the first generation of AArch64 MacOS CPUs, but clang might just be taking baseline AArch64 in general.
Can you go faster if you use uint32_t sums, so the compiler doesn't have to widen each element before adding? That means each SIMD instruction can only handle half as much data from memory as with same-sized accumulators.
https://godbolt.org/z/7c19913jE shows that Thomas Matthews' unrolling suggestion does actually get clang11 -O3 -march=apple-a13 to unroll the SIMD-vectorized asm loops it makes. That source change is not a win in general, e.g. much worse for x86-64 clang -O3 -march=haswell, but it does help here.
Another possibility is that a single core can't saturate memory bandwidth. But benchmark results published by Anandtech for example seem to rule that out: they found that even a single core can achieve 59GB/s, although that was probably running an optimize memcpy function.
(They say The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before. That sounds a bit weird; desktop / laptop Intel CPUs come pretty close, unlike their "server" chips. Maybe not as close as Apple?
M1 has pretty low memory latency compared to modern x86, so that probably helps a single core be able to track the incoming loads to keep the necessary latency x bandwidth product in flight, even with its high memory bandwidth.

Here are some techniques.
Loop Unrolling
uint64_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn += 4)
{
total += cn[0];
total += cn[1];
total += cn[2];
total += cn[3];
}
Register Prefetch
uint64_t total = 0;
for (auto cn = nums.begin(); cn < nums.end(); cn += 4)
{
const uint64 n0 = cn[0];
const uint64 n1 = cn[1];
const uint64 n2 = cn[2];
const uint64 n3 = cn[3];
total += n0;
total += n1;
total += n2;
total += n3;
}
You should print the assembly language for each of these at high optimization level and compare them.
Also, your processor may have some specialized instructions that you could. For example, the ARM processor can load multiple registers from memory with one instruction.
Also, look up SIMD instructions or search the internet for "C++ SIMD read memory".
I've argued with compilers (on embedded systems) and found out that the compiler's optimization strategies may be better or equal to instruction specialization or other techniques (timings were performed using Test Points and oscilloscope).
You'll have to remember that your task, on a one core machine, will most likely be swapped out more often that with a system with multiple cores or a specialized (embedded) system.

Consider precalculating as much as you can and using built-in STL functions, this will lead to as much optimal code as possible before trying SIMD or assembly approaches. If it's still too slow, then try the SIMD/assembly versions:
Avoid calling push_back on unreserved std::vectors: this causes the system to allocate more space when the capacity limit is reached. Since you know the size of the array before hand, reserve the space ahead of time: (for non-built-in types, consider emplace_back as well).
Additionally, the STL functions can reduce the boilerplate code down to two function calls.
Also, avoid rand().
const std::size_t GB = 1024 * 1024 * 1024;
std::vector<int> nums(4 * GB);
std::generate(std::begin(nums), std::end(nums), [](){ return rand() % 1024; });
//...
const auto sum = std::accumulate(std::begin(nums), std::end(nums), 0);

How to use c++ template to conditionally compile asm code?

There is a bool variable named "Enable", when "Enable" is false, I want to create following function:
void test_false()
{
float dst[4] = {1.0, 1.0, 1.0, 1.0};
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
float * src_addr = src;
asm volatile (
"vld1.32 {q0}, [%[src]] \n"
"vld1.32 {q1}, [%[dst]] \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n"
"vst1.32 {q0}, [%[dst]] \n"
:[src]"+r"(src_addr),
[dst]"+r"(dst_addr)
:
: "q0", "q1", "q2", "q3", "memory"
);
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
}
And when "Enable" is true, I want to create following function:
void test_true()
{
float dst[4] = {1.0, 1.0, 1.0, 1.0};
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
float * src_addr = src;
asm volatile (
"vld1.32 {q0}, [%[src]] \n"
"vld1.32 {q1}, [%[dst]] \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n" //Only here is different from test_false()
"vst1.32 {q0}, [%[dst]] \n"
:[src]"+r"(src_addr),
[dst]"+r"(dst_addr)
:
: "q0", "q1", "q2", "q3", "memory"
);
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
}
But I don't want to save two copies of code, because most of them are the same. I want to use “c++ Template + Conditional Compile” to solve my problem. The code is as follows. But it didn't work. Whether the Enable is true or false, the compiler creates the code same as test_true().
template<bool Enable>
void test_tmp()
{
float dst[4] = {1.0, 1.0, 1.0, 1.0};
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
float * src_addr = src;
if (Enable)
{
#define FUSE_
}
asm volatile (
"vld1.32 {q0}, [%[src]] \n"
"vld1.32 {q1}, [%[dst]] \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n"
#ifdef FUSE_
"vadd.f32 q0, q0, q1 \n"
#endif
"vst1.32 {q0}, [%[dst]] \n"
:[src]"+r"(src_addr),
[dst]"+r"(dst_addr)
:
: "q0", "q1", "q2", "q3", "memory"
);
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
#undef FUSE_
}
template void test_tmp<true>();
template void test_tmp<false>();
It doesn't seem possible to write code like function test_tmp(). Does anyone know how to solve my problem? Thanks a lot.

If you use C temporaries and output operands for all live registers in the first half that line up with input constraints for the 2nd half, you should be able to split it up your inline asm without any performance loss, especially if you use specific memory input/output constraints instead of a catch-all "memory" clobber. But it will get a lot more complicated.
This obviously doesn't work, because the C preprocessor runs before the C++ compiler even looks at if() statements.
if (Enable) {
#define FUSE_ // always defined, regardless of Enable
}
But the GNU assembler has its own macro / conditional-assembly directives like .if which operate on the asm the compiler emits after making text substitutions into the asm() template, including actual numeric values for immediate input operands.
Use the bool as an input operand for an assembler .if directive
Use an "i" (Enable) input constraint. Normally the %0 or %[enable] expansion of that would be #0 or #1, because that's how to print an ARM immediate. But GCC has a %c0 / %c[enable] modifier that will print a constant without punctuation. (It's documented for x86, but works the same way for ARM and presumably all other architectures. Documentation for ARM / AArch64 operand modifiers is being worked on; I've been sitting on an email about that...)
".if %c[enable] \n\t" for [enable] "i" (c_var) will substitute as .if 0 or .if 1 into the inline-asm template, exactly what we need to make .if / .endif work at assemble time.
Full example:
template<bool Enable>
void test_tmp(float dst[4])
{
//float dst[4] = {1.0, 1.0, 1.0, 1.0};
// static const // non-static-const so we can see the memory clobber vs. dummy src stop this from optimizing away init of src[] on the stack
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
const float * src_addr = src;
asm (
"vld1.32 {q1}, [%[dst]] # dummy dst = %[dummy_memdst]\n" // hopefully they pick the same regs?
"vld1.32 {q0}, [%[src]] # dummy src = %[dummy_memsrc]\n"
"vadd.f32 q0, q0, q1 \n" // TODO: optimize to q1+q1 first, without a dep on src
"vadd.f32 q0, q0, q1 \n" // allowing q0+=q1 and q1+=q1 in parallel if we need q0 += 3*q1
// #ifdef FUSE_
".if %c[enable]\n" // %c modifier: print constant without punctuation, same as documented for x86
"vadd.f32 q0, q0, q1 \n"
".endif \n"
// #endif
"vst1.32 {q0}, [%[dst]] \n"
: [dummy_memdst] "+m" (*(float(*)[4])dst_addr)
: [src]"r"(src_addr),
[dst]"r"(dst_addr),
[enable]"i"(Enable)
, [dummy_memsrc] "m" (*(const float(*)[4])src_addr)
: "q0", "q1", "q2", "q3" //, "memory"
);
/*
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
*/
}
float dst[4] = {1.0, 1.0, 1.0, 1.0};
template void test_tmp<true>(float *);
template void test_tmp<false>(float *);
compiles with GCC and Clang on the Godbolt compiler explorer
With gcc, you only get the compiler's .s output, so you have to turn off some of the usual compiler-explorer filters and look through the directives. All 3 vadd.f32 instructions are there in the false version, but one of them is surrounded by .if 0 / .endif.
But clang's built-in assembler processes assembler directives internally before turning things back into asm if that output is requested. (Normally clang/LLVM goes straight to machine code, unlike gcc which always runs a separate assembler).
Just to be clear, this works with gcc and clang, but it's just easier to see it on Godbolt with clang. (Because Godbolt doesn't have a "binary" mode that actually assembles and then disassembles, except for x86). Clang output for the false version
...
vld1.32 {d2, d3}, [r0] # dummy dst = [r0]
vld1.32 {d0, d1}, [r1] # dummy src = [r1]
vadd.f32 q0, q0, q1
vadd.f32 q0, q0, q1
vst1.32 {d0, d1}, [r0]
...
Notice that clang picked the same GP register for the raw pointers as it used for the memory operand. (gcc seems to choose [sp] for src_mem, but a different reg for the pointer input that you use manually inside an addressing mode). If you hadn't forced it to have the pointers in registers, it could have used an SP-relative addressing mode with an offset for the vector loads, potentially taking advantage of ARM addressing modes.
If you're really not going to modify the pointers inside the asm (e.g. with post-increment addressing modes), then "r" input-only operands makes the most sense. If we'd left in the printf loop, the compiler would have needed dst again after the asm, so it would benefit from having it still in a register. A "+r"(dst_addr) input forces the compiler to assume that that register is no longer usable as a copy of dst. Anyway, gcc always copies the registers, even when it doesn't need it later, whether I make it "r" or "+r", so that's weird.
Using (dummy) memory inputs / outputs means we can drop the volatile, so the compiler can optimize it normally as a pure function of its inputs. (And optimize it away if the result is unused.)
Hopefully this isn't worse code-gen that with the "memory" clobber. But it would probably be better if you just used the "=m" and "m" memory operands, and didn't ask for pointers in registers at all. (That doesn't help if you're going to loop over the array with inline asm, though.)
See also Looping over arrays with inline assembly

I haven't been doing ARM assembly for few years, and I never really bothered to learn GCC inline assembly properly, but I think your code can be rewritten like this, using intrinsics:
#include <cstdio>
#include <arm_neon.h>
template<bool Enable>
void test_tmp()
{
const float32x4_t src = {1.0, 2.0, 3.0, 4.0};
const float32x4_t src2 = {1.0, 1.0, 1.0, 1.0};
float32x4_t z;
z = vaddq_f32(src, src2);
z = vaddq_f32(z, src2);
if (Enable) z = vaddq_f32(z, src2);
float result[4];
vst1q_f32(result, z);
for (int i = 0; i < 4; i++)
{
printf("%f, ", result[i]);//0.0 0.0 0.0 0.0
}
}
template void test_tmp<true>();
template void test_tmp<false>();
You can see resulting machine code + toy around live at: https://godbolt.org/z/Fg7Tci
Compiled with ARM gcc8.2 and command line options "-O3 -mfloat-abi=softfp -mfpu=neon" the "true" variant is:
void test_tmp<true>():
vmov.f32 q9, #1.0e+0 # v4sf
vldr d16, .L6
vldr d17, .L6+8
# and the FALSE variant has one less vadd.f32 in this part
vadd.f32 q8, q8, q9
vadd.f32 q8, q8, q9
vadd.f32 q8, q8, q9
push {r4, r5, r6, lr}
sub sp, sp, #16
vst1.32 {d16-d17}, [sp:64]
mov r4, sp
ldr r5, .L6+16
add r6, sp, #16
.L2:
vldmia.32 r4!, {s15}
vcvt.f64.f32 d16, s15
mov r0, r5
vmov r2, r3, d16
bl printf
cmp r4, r6
bne .L2
add sp, sp, #16
pop {r4, r5, r6, pc}
.L6:
.word 1065353216
.word 1073741824
.word 1077936128
.word 1082130432
.word .LC0
.LC0:
.ascii "%f, \000"
This still leaves me profoundly confused by why the gcc doesn't simply calculate final string with values as string for output, as the inputs are constant. Maybe it's some math-rule about precision preventing it to do that in compile-time as the result could differ slightly from actual target HW platform FPU? I.e. with some fast-math switch it would probably drop that code completely and just produce one output string...
But I guess your code is not actually proper "MCVE" of what you are doing, and the test values would be fed into some real function you are testing, or something like that.
Anyway, if you are working on performance optimizations, you should probably rather avoid inline assembly completely and use intrinsics instead, as that allows the compiler to better allocate registers and optimize code around the calculations (I didn't track it precisely, but I think the last version of this experiment in godbolt was 2-4 instructions shorter/simpler than the original using inline assembly).
Plus you will avoid the incorrect asm constraints like your example code has, those are always tricky to get correctly and pure PITA to maintain if you keep modifying the inlined code often.

Fastest way to perform AVX inner product operations with mixed (float, double) input vectors

I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!

I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)
__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
__m256 X = _mm256_loadu_ps(x + i);
__m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
__m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));
__m256 Y = _mm256_set_m128(yD43, yD21);
return _mm256_fmadd_ps(X, Y, res_prev);
}
I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.
EDIT: this is the disassembly of the function:
vcvtpd2ps xmm3,ymmword ptr [rdx+r8*8+20h]
vcvtpd2ps xmm2,ymmword ptr [rdx+r8*8]
vmovups ymm0,ymmword ptr [rcx+r8*4]
vinsertf128 ymm3,ymm2,xmm3,1
vfmadd213ps ymm0,ymm3,ymmword ptr [r9]
EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+30h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8+20h]
vmovlhps xmm3,xmm1,xmm0
vcvtpd2ps xmm0,xmmword ptr [rdx+r8*8+10h]
vcvtpd2ps xmm1,xmmword ptr [rdx+r8*8]
vmovlhps xmm2,xmm1,xmm0
vperm2f128 ymm3,ymm2,ymm3,20h
vmulps ymm0,ymm3,ymmword ptr [rcx+r8*4]
vaddps ymm0,ymm0,ymmword ptr [r9]

Fastest downscaling of 8bit gray image with SSE

I have a function which downscales an 8-Bit image by a factor of two. I have previously optimised the rgb32 case with SSE. Now I would like to do the same for the gray8 case.
At the core, there is a function taking two lines of pixel data, which works like this:
/**
* Calculates the average of two rows of gray8 pixels by averaging four pixels.
*/
void average2Rows(const uint8_t* row1, const uint8_t* row2, uint8_t* dst, int size)
{
for (int i = 0; i < size - 1; i += 2)
*(dst++) = ((row1[i]+row1[i+1]+row2[i]+row2[i+1])/4)&0xFF;
}
Now, I have come up with an SSE variant which is about three times faster, but it does involve a lot of shuffling and I think one might do better. Does anybody see what can be optimised here?
/* row1: 16 8-bit values A-P
* row2: 16 8-bit values a-p
* returns 16 8-bit values (A+B+a+b)/4, (C+D+c+d)/4, ..., (O+P+o+p)/4
*/
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
static const __m128i zero = _mm_setzero_si128();
__m128i ABCDEFGHIJKLMNOP = _mm_avg_epu8(row1_u8, row2);
__m128i ABCDEFGH = _mm_unpacklo_epi8(ABCDEFGHIJKLMNOP, zero);
__m128i IJKLMNOP = _mm_unpackhi_epi8(ABCDEFGHIJKLMNOP, zero);
__m128i AIBJCKDL = _mm_unpacklo_epi16( ABCDEFGH, IJKLMNOP );
__m128i EMFNGOHP = _mm_unpackhi_epi16( ABCDEFGH, IJKLMNOP );
__m128i AEIMBFJN = _mm_unpacklo_epi16( AIBJCKDL, EMFNGOHP );
__m128i CGKODHLP = _mm_unpackhi_epi16( AIBJCKDL, EMFNGOHP );
__m128i ACEGIKMO = _mm_unpacklo_epi16( AEIMBFJN, CGKODHLP );
__m128i BDFHJLNP = _mm_unpackhi_epi16( AEIMBFJN, CGKODHLP );
return _mm_avg_epu8(ACEGIKMO, BDFHJLNP);
}
/*
* Calculates the average of two rows of gray8 pixels by averaging four pixels.
*/
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
for(int i = 0;i<size-31; i+=32)
{
__m128i tl = _mm_loadu_si128((__m128i const*)(src1+i));
__m128i tr = _mm_loadu_si128((__m128i const*)(src1+i+16));
__m128i bl = _mm_loadu_si128((__m128i const*)(src2+i));
__m128i br = _mm_loadu_si128((__m128i const*)(src2+i+16)))
__m128i l_avg = avg16Bytes(tl, bl);
__m128i r_avg = avg16Bytes(tr, br);
_mm_storeu_si128((__m128i *)(dst+(i/2)), _mm_packus_epi16(l_avg, r_avg));
}
}
Notes:
I realise my function has slight (off by one) rounding errors, but I am willing to accept this.
For clarity I have assumed size is a multiple of 32.
EDIT: There is now a github repository implementing the answers to this question. The fastest solution was provided by user Peter Cordes. See his essay below for details:
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
// Average the first 16 values of src1 and src2:
__m128i avg = _mm_avg_epu8(row1, row2);
// Unpack and horizontal add:
avg = _mm_maddubs_epi16(avg, _mm_set1_epi8(1));
// Divide by 2:
return _mm_srli_epi16(avg, 1);
}
It works as my original implementation by calculating (a+b)/2 + (c+d)/2 as opposed to (a+b+c+d)/4, so it has the same off-by-one rounding error.
Cudos to user Paul R for implementing a solution which is twice as fast as mine, but exact:
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
// Unpack and horizontal add:
__m128i row1 = _mm_maddubs_epi16(row1_u8, _mm_set1_epi8(1));
__m128i row2 = _mm_maddubs_epi16(row2_u8, _mm_set1_epi8(1));
// vertical add:
__m128i avg = _mm_add_epi16(row1_avg, row2_avg);
// divide by 4:
return _mm_srli_epi16(avg, 2);
}

If you're willing to accept double-rounding from using pavgb twice, you can go faster than Paul R's answer by doing the vertical averaging first with pavgb, cutting in half the amount of data that needs to be unpacked to 16-bit elements. (And allowing half the loads to fold into memory operands for pavgb, reducing front-end bottlenecks on some CPUs.)
For horizontal averaging, your best bet is probably still pmaddubsw with set1(1) and shift by 1, then pack.
// SSSE3 version
// I used `__restrict__` to give the compiler more flexibility in unrolling
void average2Rows_doubleround(const uint8_t* __restrict__ src1, const uint8_t*__restrict__ src2,
uint8_t*__restrict__ dst, size_t size)
{
const __m128i vk1 = _mm_set1_epi8(1);
size_t dstsize = size/2;
for (size_t i = 0; i < dstsize - 15; i += 16)
{
__m128i v0 = _mm_load_si128((const __m128i *)&src1[i*2]);
__m128i v1 = _mm_load_si128((const __m128i *)&src1[i*2 + 16]);
__m128i v2 = _mm_load_si128((const __m128i *)&src2[i*2]);
__m128i v3 = _mm_load_si128((const __m128i *)&src2[i*2 + 16]);
__m128i left = _mm_avg_epu8(v0, v2);
__m128i right = _mm_avg_epu8(v1, v3);
__m128i w0 = _mm_maddubs_epi16(left, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(right, vk1);
w0 = _mm_srli_epi16(w0, 1); // divide by 2
w1 = _mm_srli_epi16(w1, 1);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i], w0);
}
}
The other option is _mm_srli_epi16(v, 8) to line up the odd elements with the even elements of every horizontal pair. But since there is no horizontal pack with truncation, you have to _mm_and_si128(v, _mm_set1_epi16(0x00FF)) both halves before you pack. It turns out to be slower than using SSSE3 pmaddubsw, especially without AVX where it takes extra MOVDQA instructions to copy registers.
void average2Rows_doubleround_SSE2(const uint8_t* __restrict__ src1, const uint8_t* __restrict__ src2, uint8_t* __restrict__ dst, size_t size)
{
size /= 2;
for (size_t i = 0; i < size - 15; i += 16)
{
__m128i v0 = _mm_load_si128((__m128i *)&src1[i*2]);
__m128i v1 = _mm_load_si128((__m128i *)&src1[i*2 + 16]);
__m128i v2 = _mm_load_si128((__m128i *)&src2[i*2]);
__m128i v3 = _mm_load_si128((__m128i *)&src2[i*2 + 16]);
__m128i left = _mm_avg_epu8(v0, v2);
__m128i right = _mm_avg_epu8(v1, v3);
__m128i l_odd = _mm_srli_epi16(left, 8); // line up horizontal pairs
__m128i r_odd = _mm_srli_epi16(right, 8);
__m128i l_avg = _mm_avg_epu8(left, l_odd); // leaves garbage in the high halves
__m128i r_avg = _mm_avg_epu8(right, r_odd);
l_avg = _mm_and_si128(l_avg, _mm_set1_epi16(0x00FF));
r_avg = _mm_and_si128(r_avg, _mm_set1_epi16(0x00FF));
__m128i avg = _mm_packus_epi16(l_avg, r_avg); // pack
_mm_storeu_si128((__m128i *)&dst[i], avg);
}
}
With AVX512BW, there's _mm_cvtepi16_epi8, but IACA says it's 2 uops on Skylake-AVX512, and it only takes 1 input and produces a half-width output. According to IACA, the memory-destination form is 4 total unfused-domain uops (same as reg,reg + separate store). I had to use _mm_mask_cvtepi16_storeu_epi8(&dst\[i+0\], -1, l_avg); to get it, because gcc and clang fail to fold a separate _mm_store into a memory destination for vpmovwb. (There is no non-masked store intrinsic, because compilers are supposed to do that for you like they do with folding _mm_load into memory operands for typical ALU instructions).
It's probably only useful when narrowing to 1/4 or 1/8th (cvtepi64_epi8), not just in half. Or maybe useful to avoid needing a second shuffle to deal with the in-lane behaviour of _mm512_packus_epi16. With AVX2, after a _mm256_packus_epi16 on [D C] [B A], you have [D B | C A], which you can fix with an AVX2 _mm256_permute4x64_epi64 (__m256i a, const int imm8) to shuffle in 64-bit chunks. But with AVX512, you'd need a vector shuffle-control for the vpermq. packus + a fixup shuffle is probably still a better option, though.
Once you do this, there aren't many vector instructions left in the loop, and there's a lot to gain from letting the compiler make tighter asm. Your loop is unfortunately difficult for compilers to do a good job with.
(This also helps Paul R's solution, since he copied the compiler-unfriendly loop structure from the question.)
Use the loop-counter in a way that gcc/clang can optimize better, and use types that avoid re-doing sign extension every time through the loop.
With your current loop, gcc/clang actually do an arithmetic right-shift for i/2, instead of incrementing by 16 (instead of 32) and using scaled-index addressing modes for the loads. It seems they don't realize that i is always even.
(full code + asm on Matt Godbolt's compiler explorer):
.LBB1_2: ## clang's inner loop for int i, dst[i/2] version
movdqu xmm1, xmmword ptr [rdi + rcx]
movdqu xmm2, xmmword ptr [rdi + rcx + 16]
movdqu xmm3, xmmword ptr [rsi + rcx]
movdqu xmm4, xmmword ptr [rsi + rcx + 16]
pavgb xmm3, xmm1
pavgb xmm4, xmm2
pmaddubsw xmm3, xmm0
pmaddubsw xmm4, xmm0
psrlw xmm3, 1
psrlw xmm4, 1
packuswb xmm3, xmm4
mov eax, ecx # This whole block is wasted instructions!!!
shr eax, 31
add eax, ecx
sar eax # eax = ecx/2, with correct rounding even for negative `i`
cdqe # sign-extend EAX into RAX
movdqu xmmword ptr [rdx + rax], xmm3
add rcx, 32 # i += 32
cmp rcx, r8
jl .LBB1_2 # }while(i < size-31)
gcc7.1 isn't quite so bad, (just mov/sar/movsx), but gcc5.x and 6.x do separate pointer-increments for src1 and src2, and also for a counter/index for the stores. (Totally braindead behaviour, especially since they still do it with -march=sandybridge. Indexed movdqu stores and non-indexed movdqu loads gives you the maximum loop overhead.)
Anyway, using dstsize and multiplying i inside the loop instead of dividing it gives much better results. Different versions of gcc and clang reliably compile it into a single loop-counter that they use with a scaled-index addressing mode for the loads. You get code like:
movdqa xmm1, xmmword ptr [rdi + 2*rax]
movdqa xmm2, xmmword ptr [rdi + 2*rax + 16]
pavgb xmm1, xmmword ptr [rsi + 2*rax]
pavgb xmm2, xmmword ptr [rsi + 2*rax + 16] # saving instructions with aligned loads, see below
...
movdqu xmmword ptr [rdx + rax], xmm1
add rax, 16
cmp rax, rcx
jb .LBB0_2
I used size_t i to match size_t size, to make sure gcc didn't waste any instructions sign-extending or zero-extending it to the width of a pointer. (zero-extension usually happens for free, though, so unsigned size and unsigned i might have been ok, and saved a couple REX prefixes.)
You could still get rid of the cmp but counting an index up towards 0, which would speed the loop up a little bit more than what I've done. I'm not sure how easy it would be to get compilers to not be stupid and omit the cmp instruction if you do count up towards zero. Indexing from the end of an object is no problem, though. src1+=size;. It does complicate things if you want to use an unaligned-cleanup loop, though.
On my Skylake i7-6700k (max turbo 4.4GHz, but look at the clock-cycle counts rather than times). With g++7.1, this makes a difference of ~2.7 seconds for 100M reps of 1024 bytes vs. ~3.3 seconds.
Performance counter stats for './grayscale-dowscale-by-2.inline.gcc-skylake-noavx' (2 runs):
2731.607950 task-clock (msec) # 1.000 CPUs utilized ( +- 0.40% )
2 context-switches # 0.001 K/sec ( +- 20.00% )
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.032 K/sec ( +- 0.57% )
11,917,723,707 cycles # 4.363 GHz ( +- 0.07% )
42,006,654,015 instructions # 3.52 insn per cycle ( +- 0.00% )
41,908,837,143 uops_issued_any # 15342.186 M/sec ( +- 0.00% )
49,409,631,052 uops_executed_thread # 18088.112 M/sec ( +- 0.00% )
3,301,193,901 branches # 1208.517 M/sec ( +- 0.00% )
100,013,629 branch-misses # 3.03% of all branches ( +- 0.01% )
2.731715466 seconds time elapsed ( +- 0.40% )
vs. Same vectorization, but with int i and dst[i/2] creating higher loop overhead (more scalar instructions):
Performance counter stats for './grayscale-dowscale-by-2.loopoverhead-aligned-inline.gcc-skylake-noavx' (2 runs):
3314.335833 task-clock (msec) # 1.000 CPUs utilized ( +- 0.02% )
4 context-switches # 0.001 K/sec ( +- 14.29% )
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.026 K/sec ( +- 0.57% )
14,531,925,552 cycles # 4.385 GHz ( +- 0.06% )
51,607,478,414 instructions # 3.55 insn per cycle ( +- 0.00% )
51,109,303,460 uops_issued_any # 15420.677 M/sec ( +- 0.00% )
55,810,234,508 uops_executed_thread # 16839.040 M/sec ( +- 0.00% )
3,301,344,602 branches # 996.080 M/sec ( +- 0.00% )
100,025,451 branch-misses # 3.03% of all branches ( +- 0.00% )
3.314418952 seconds time elapsed ( +- 0.02% )
vs. Paul R's version (optimized for lower loop overhead): exact but slower
Performance counter stats for './grayscale-dowscale-by-2.paulr-inline.gcc-skylake-noavx' (2 runs):
3751.990587 task-clock (msec) # 1.000 CPUs utilized ( +- 0.03% )
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.024 K/sec ( +- 0.56% )
16,323,525,446 cycles # 4.351 GHz ( +- 0.04% )
58,008,101,634 instructions # 3.55 insn per cycle ( +- 0.00% )
57,610,721,806 uops_issued_any # 15354.709 M/sec ( +- 0.00% )
55,505,321,456 uops_executed_thread # 14793.566 M/sec ( +- 0.00% )
3,301,456,435 branches # 879.921 M/sec ( +- 0.00% )
100,001,954 branch-misses # 3.03% of all branches ( +- 0.02% )
3.752086635 seconds time elapsed ( +- 0.03% )
vs. Paul R's original version with extra loop overhead:
Performance counter stats for './grayscale-dowscale-by-2.loopoverhead-paulr-inline.gcc-skylake-noavx' (2 runs):
4154.300887 task-clock (msec) # 1.000 CPUs utilized ( +- 0.01% )
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
90 page-faults:u # 0.022 K/sec ( +- 1.68% )
18,174,791,383 cycles # 4.375 GHz ( +- 0.03% )
67,608,724,157 instructions # 3.72 insn per cycle ( +- 0.00% )
66,937,292,129 uops_issued_any # 16112.769 M/sec ( +- 0.00% )
61,875,610,759 uops_executed_thread # 14894.350 M/sec ( +- 0.00% )
3,301,571,922 branches # 794.736 M/sec ( +- 0.00% )
100,029,270 branch-misses # 3.03% of all branches ( +- 0.00% )
4.154441330 seconds time elapsed ( +- 0.01% )
Note that branch-misses is about the same as the repeat count: the inner loop mispredicts at the end every time. Unrolling to keep the loop iteration count under about 22 would make the pattern short enough for Skylake's branch predictors to predict the not-taken condition correctly most of the time. Branch mispredicts are the only reason we're not getting ~4.0 uops per cycle through the pipeline, so avoiding branch misses would raise the IPC from 3.5 to over 4.0 (cmp/jcc macro-fusion puts 2 instructions in one uop).
These branch-misses probably hurt even if you're bottlenecked on L2 cache bandwidth (instead of the front-end). I didn't test that, though: my testing just wraps a for() loop around the function call from Paul R's test harness, so everything's hot in L1D cache. 32 iterations of the inner loop is close to the worst-case here: low enough for frequent mispredicts, but not so low that branch-prediction can pick up the pattern and avoid them.
My version should run in 3 cycles per iteration, bottlenecked only on the frontend, on Intel Sandybridge and later. (Nehalem will bottleneck on one load per clock.)
See http://agner.org/optimize/, and also Can x86's MOV really be "free"? Why can't I reproduce this at all? for more about fused-domain vs. unfused-domain uops and perf counters.
update: clang unrolls it for you, at least when the size is a compile-time constant... Strangely, it unrolls even the non-inline version of the dst[i/2] function (with unknown size), but not the lower-loop-overhead version.
With clang++-4.0 -O3 -march=skylake -mno-avx, my version (unrolled by 2 by the compiler) runs in: 9.61G cycles for 100M iters (2.2s). (35.6G uops issued (fused domain), 45.0G uops executed (unfused domain), near-zero branch-misses.) Probably not bottlenecked on the front-end anymore, but AVX would still hurt.
Paul R's (also unrolled by 2) runs in 12.29G cycles for 100M iters (2.8s). 48.4G uops issued (fused domain), 51.4G uops executed (unfused-domain). 50.1G instructions, for 4.08 IPC, probably still bottlenecked on the front-end (because it needs a couple movdqa instructions to copy a register before destroying it). AVX would help for non-destructive vector instructions, even without AVX2 for wider integer vectors.
With careful coding, you should be able to do about this well for runtime-variable sizes.
Use aligned pointers and aligned loads, so the compiler can use pavgb with a memory operand instead of using a separate unaligned-load instruction. This means fewer instructions and fewer uops for the front-end, which is a bottleneck for this loop.
This doesn't help Paul's version, because only the second operand for pmaddubsw can come from memory, and that's the one treated as signed bytes. If we used _mm_maddubs_epi16(_mm_set1_epi8(1), v0);, the 16-bit multiply result would be sign-extended instead of zero-extended. So 1+255 would come out to 0 instead of 256.
Folding a load requires alignment with SSE, but not with AVX. However, on Intel Haswell/Skylake, indexed addressing modes can only stay micro-fused with instructions which read-modify-write their destination register. vpavgb xmm0, xmm0, [rsi+rax*2] is un-laminated to 2 uops on Haswell/Skylake before it issues into the out-of-order part of the core, but pavgb xmm1, [rsi+rax*2] can stay micro-fused all the way through, so it issues as a single uop. The front-end issue bottleneck is 4 fused-domain uops per clock on mainstream x86 CPUs except Ryzen (i.e. not Atom/Silvermont). Folding half the loads into memory operands helps with that on all Intel CPUs except Sandybridge/Ivybridge, and all AMD CPUs.
gcc and clang will fold the loads when inlining into a test function that uses alignas(32), even if you use _mm_loadu intrinsics. They know the data is aligned, and take advantage.
Weird fact: compiling the 128b-vectorized code with AVX code-gen enabled (-march=native) actually slows it down on Haswell/Skylake, because it would make all 4 loads issue as separate uops even when they're memory operands for vpavgb, and there aren't any movdqa register-copying instructions that AVX would avoid. (Usually AVX comes out ahead anyway even for manually-vectorized code that still only uses 128b vectors, because of the benefit of 3-operand instructions not destroying one of their inputs.) In this case, 13,53G cycles ( +- 0.05% ) or 3094.195773 ms ( +- 0.20% ), up from 11.92G cycles in ~2.7 seconds. uops_issued = 48.508G, up from 41,908. Instruction count and uops_executed counts are essentially the same.
OTOH, an actual 256b AVX2 version would run slightly a bit less than twice as fast. Some unrolling to reduce the front-end bottleneck will definitely help. An AVX512 version might run close to 4x as fast on Skylake-AVX512 Xeons, but might bottleneck on ALU throughput since SKX shuts down execution port1 when there are any 512b uops in the RS waiting to execute, according to #Mysticial's testing. (That explains why pavgb zmm has 1 per clock throughput while pavgb ymm is 2 per clock..)
To have both input rows aligned, store your image data in a format with a row stride that's a multiple of 16, even if the actual image dimensions are odd. Your storage stride doesn't have to match your actual image dimensions.
If you can only align either the source or dest (e.g. because you're downscaling a region that starts at an odd column in the source image), you should probably still align your source pointers.
Intel's optimization manual recommends aligning the destination instead of the source, if you can't align both, but doing 4x as many loads as stores probably changes the balance.
To handle unaligned at the start/end, do a potentially-overlapping unaligned vector of pixels from the start and end. It's fine for stores to overlap other stores, and since dst is separate from src, you can redo a partially-overlapping vector.
In Paul's test main(), I just added alignas(32) in front of every array.
AVX2:
Since you compile one version with -march=native, you can easily detect AVX2 at compile time with #ifdef __AVX2__. There's no simple way to use exactly the same code for 128b and 256b manual vectorization. All the intrinsics have different names, so you typically need to copy everything even if there are no other differences.
(There are some C++ wrapper libraries for the intrinsics that use operator-overloading and function overloading to let you write a templated version that uses the same logic on different widths of vector. e.g. Agner Fog's VCL is good, but unless your software is open-source, you can't use it because it's GPL licensed and you want to distribute a binary.)
To take advantage of AVX2 in your binary-distribution version, you'd have to do runtime detection/dispatching. In that case, you'd want to dispatch to versions of a function that loops over rows, so you don't have dispatch overhead inside your loop over rows. Or just let that version use SSSE3.

Here is an implementation which uses fewer instructions. I haven't benchmarked it against your code though, so it may not be significantly faster:
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
const __m128i vk1 = _mm_set1_epi8(1);
for (int i = 0; i < size - 31; i += 32)
{
__m128i v0 = _mm_loadu_si128((__m128i *)&src1[i]);
__m128i v1 = _mm_loadu_si128((__m128i *)&src1[i + 16]);
__m128i v2 = _mm_loadu_si128((__m128i *)&src2[i]);
__m128i v3 = _mm_loadu_si128((__m128i *)&src2[i + 16]);
__m128i w0 = _mm_maddubs_epi16(v0, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(v1, vk1);
__m128i w2 = _mm_maddubs_epi16(v2, vk1);
__m128i w3 = _mm_maddubs_epi16(v3, vk1);
w0 = _mm_add_epi16(w0, w2); // vertical add
w1 = _mm_add_epi16(w1, w3);
w0 = _mm_srli_epi16(w0, 2); // divide by 4
w1 = _mm_srli_epi16(w1, 2);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i / 2], w0);
}
}
Test harness:
#include <stdio.h>
#include <stdlib.h>
#include <tmmintrin.h>
void average2Rows_ref(const uint8_t* row1, const uint8_t* row2, uint8_t* dst, int size)
{
for (int i = 0; i < size - 1; i += 2)
{
dst[i / 2] = (row1[i] + row1[i + 1] + row2[i] + row2[i + 1]) / 4;
}
}
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
const __m128i vk1 = _mm_set1_epi8(1);
for (int i = 0; i < size - 31; i += 32)
{
__m128i v0 = _mm_loadu_si128((__m128i *)&src1[i]);
__m128i v1 = _mm_loadu_si128((__m128i *)&src1[i + 16]);
__m128i v2 = _mm_loadu_si128((__m128i *)&src2[i]);
__m128i v3 = _mm_loadu_si128((__m128i *)&src2[i + 16]);
__m128i w0 = _mm_maddubs_epi16(v0, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(v1, vk1);
__m128i w2 = _mm_maddubs_epi16(v2, vk1);
__m128i w3 = _mm_maddubs_epi16(v3, vk1);
w0 = _mm_add_epi16(w0, w2); // vertical add
w1 = _mm_add_epi16(w1, w3);
w0 = _mm_srli_epi16(w0, 2); // divide by 4
w1 = _mm_srli_epi16(w1, 2);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i / 2], w0);
}
}
int main()
{
const int n = 1024;
uint8_t src1[n];
uint8_t src2[n];
uint8_t dest_ref[n / 2];
uint8_t dest_test[n / 2];
for (int i = 0; i < n; ++i)
{
src1[i] = rand();
src2[i] = rand();
}
for (int i = 0; i < n / 2; ++i)
{
dest_ref[i] = 0xaa;
dest_test[i] = 0x55;
}
average2Rows_ref(src1, src2, dest_ref, n);
average2Rows(src1, src2, dest_test, n);
for (int i = 0; i < n / 2; ++i)
{
if (dest_test[i] != dest_ref[i])
{
printf("%u %u %u %u: ref = %u, test = %u\n", src1[2 * i], src1[2 * i + 1], src2[2 * i], src2[2 * i + 1], dest_ref[i], dest_test[i]);
}
}
return 0;
}
Note that the output of the SIMD version exactly matches the output of the scalar reference code (no "off by one" rounding errors).

NEON assembly code, how to convert BYTE to float?

now i have a code want to improve. the datatype of src data is byte. i want to calculate with float and store the result to byte. but i dont know how to convert datatype of data bwtween BYTE and float. I develop on Android NDK. the C++ code i want to improve followed:
void DoEffect(BYTE *pSrc, float rat){
//image data:BGRA
float red, green, blue;
red = pSrc[RED_CHANNEL] * rat;
green = pSrc[GREEN_CHANNEL] * rat;
blue = pSrc[BLUE_CHANNEL] * rat;
// some step to calculate the result;
// red = ...
// ...
//
pSrc[RED_CHANNEL] = (BYTE)red;
pSrc[GREEN_CHANNEL] = (BYTE)green;
pSrc[BLUE_CHANNEL] = (BYTE)blue;
}
and my neon asm code:
void DoEffect_neon_asm(BYTE *pSrc, float rat){
__asm__ __volatile__(
"vld4.8 {d0-d3},[%[src]] \n"
"vdupq.32 {d4, d5}, [%[rat]] \n"
"# convert src data to float? \n"
"code: convert byte to float \n"
"# calculate result \n"
".. \n"
"# convert result to byte \n"
"code: convert float to byte \n"
:[src]"r"(pSrc), [rat]"r"(rat)
);
}
my problem is how to code "code: convert byte to float" and "code: convert float to byte" listed in "neon asm code".

Converting bytes to floats is fairly straightforward. The code below will do this for one register of bytes:
"vmovl.u8 q3, d2 \n\t" //Expand to 16-bit
"vmovl.u16 q10, d6 \n\t" //Expand to 32-bit
"vmovl.u16 q11, d7 \n\t"
"vcvt.f32.u32 q10, q10 \n\t" //Convert to float
"vcvt.f32.u32 q11, q11 \n\t"
Converting back to bytes is almost exactly the reverse process. Use vcvt.u32.f32 and vmovn instead.
Also, if you're new to NEON, I would recommend reading the documentation thoroughly. It's a good way to get a handle on the instructions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js