neon float multiplication is slower than expected - c++

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
I run the following command to compile:
g++ -mfpu=neon -ffast-math neon_test.cpp
My CPU: ARMv7 Processor rev 0 (v7l)
Do you have any ideas how I can achieve more significant speed-up?

Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).

One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.
Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon
000000a4 <mul_tab_neon>:
ac: e0805003 add r5, r0, r3
b0: e0814003 add r4, r1, r3
b4: e082c003 add ip, r2, r3
b8: e2833010 add r3, r3, #16
bc: f4650a8f vld1.32 {d16-d17}, [r5]
c0: f4642a8f vld1.32 {d18-d19}, [r4]
c4: e3530e19 cmp r3, #400 ; 0x190
c8: f3400df2 vmul.f32 q8, q8, q9
cc: f44c0a8f vst1.32 {d16-d17}, [ip]
d0: 1afffff5 bne ac <mul_tab_neon+0x8>
and this is loop part of mul_tab_standard
00000000 <mul_tab_standard>:
58: ecf01b02 vldmia r0!, {d17}
5c: ecf10b02 vldmia r1!, {d16}
60: f3410db0 vmul.f32 d16, d17, d16
64: ece20b02 vstmia r2!, {d16}
68: e1520003 cmp r2, r3
6c: 1afffff9 bne 58 <mul_tab_standard+0x58>
As you can see in standard case, compiler creates much tighter loop.

Related

ARM 7 Assembly - ADC with immediate 0

I have written a little c++ function on godbolt.org and I am curious about a certain line inside the assembly. Here is the function:
unsigned long long foo(uint64_t a, uint8_t b){
// unsigned long long fifteen = 15 * b;
// unsigned long long result = a + fifteen;
// unsigned long long resultfinal = result / 2;
// return resultfinal;
return (a+(15*b)) / 2;
}
The generated assembly:
rsb r2, r2, r2, lsl #4
adds r0, r2, r0
adc r1, r1, #0
lsrs r1, r1, #1
rrx r0, r0
Now I dont understand why the line with the ADC instruction happens. It adds 0 to the high of the 64 bit number. Why does it do that?
Here is the link if you want to play yourself:
Link to assembly
The arm32 is only 32 bits. The value 'a' is 64bits. The instructions that you are seeing are to allow computations of sizes larger than 32bits.
rsb r2, r2, r2, lsl #4 # 15*b -> b*16-b
adds r0, r2, r0 # a+(15*b) !LOW 32 bits! could carry.
adc r1, r1, #0 # add a carry bit to the high portion
lsrs r1, r1, #1 # divide high part by 2; (a+(15*b))/2
rrx r0, r0 # The opposite of add with carry flowing down.
Note: if you are confused by the adc instruction, then the rrx will also be confusing? It is a 'dual' of the addition/multiplication. For division you need to take care of underflow in the higher part and put it in the next lower value.
I think the important point is that you can 'carry' this logic to arbitrarily large values. It has applications in cryptography, large value finance and other high accuracy science and engineering applications.
See: Gnu Multi-precision library, libtommath, etc.

How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

Below are the given c++ and ARM code for same program. Can you tell me this ARM code is optimized or not and how many does the loop requires(The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.)? What should I do to optimize the code using loop unrolling (process 4 elements at a time)?
c++ code:
// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr,
unsigned char k, int n)
{
for(int i=0; i<n; i++)
*outArr++ = *inArr++ ^ k; // ^ is bitwise xor
}
ARM CODE:
invert:
cmp r3, #0
bxle lr
add ip, r0, r3
.L3:
ldrb r3, [r1], #1 # zero_extendqisi2
eor r3, r3, r2
strb r3, [r0], #1
cmp ip, r0
bne .L3
bx lr
I have no idea what 'code preload' means. There is data preloading with the pld instruction. It would make sense in the context of the sample code.
Here is the basic 'C' version given the assumptions,
The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.
The code is probably not perfect, but meant to illustrate.
// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr,
unsigned char k, int n)
{
unsigned int *out = (void*)outArr;
unsigned int *in = (void*)inArr;
unsigned int mask = k<<24|k<<16|k<<8|k;
/* Check arguments */
if( n % 64 != 0) return;
if((int)outArr & 3) return;
if((int)inArr & 3) return;
assert(sizeof(int)==4);
for(int i=0; i<n/sizeof(int); i+=64/(sizeof(int)) {
/* 16 transfers per loop 64/4 */
*out++ = *in++ ^ k; // 1
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 5
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 9
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k; // 13
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
*out++ = *in++ ^ k;
}
}
You can view the output on godbolt.
The ldm and stm instructions can be used to load consecutive memory addresses to registers. We can not use all 16 ARM registers, so the core of the loop in assembler would look like this,
ldmia [r1], {r4,r5,r6,r7,r8,r9,r10,r11} # r1 is inArr
eor r4,r4,r2 # r2 is expanded k
eor r5,r5,r2
eor r6,r6,r2
eor r7,r7,r2
eor r8,r8,r2
eor r9,r9,r2
eor r10,r10,r2
eor r11,r11,r2
stmia [r0], {r4,r5,r6,r7,r8,r9,r10,r11} # r0 is outArr
This is repeated twice and the R0 or R1 can be checked against the array limits stored in R3. You need to save all of the callee saved registers if you want to be EABI compliant. The register set r4-r11 can generally be used, but it will depend on the system. You can also use lr, fp, etc if you save them and are not exception safe.
From the comments,
I am trying to find that how many cycles does this subroutine take per
array element when it is optimized and when it isn't.
Cycle counts are extremely difficult on modern CPUs. However you have five instructions in the core with a simple loop,
.L3:
ldrb r3, [r1], #1 # zero_extendqisi2
eor r3, r3, r2
strb r3, [r0], #1
cmp ip, r0
bne .L3
To do 32 bytes, this is 32 * 5 (160) instructions. With 32 * 2 memory accesses.
The expanded options is just one 32byte memory read and write. These will complete, with the lowest value available first. Then just a single EOR instruction. So it is just 10 instructions versus 160. On modern processors the memory will be the limiting factor. Because of memory stalls, it maybe better to only process four words at a time such as,
ldmia [r1], {r4,r5,r6,r7} # r1 is inArr
eor r4,r4,r2 # r2 is expanded k
eor r5,r5,r2
eor r6,r6,r2
eor r7,r7,r2
ldmia [r1], {r8,r9,r10,r11} # r1 is inArr
stmia [r0], {r4,r5,r6,r7} # r0 is outArr
...
This (or some permutation) will allow the load/store unit and the 'eor' to not block each other, but this will depend on the particular CPU type. This topic is called instruction scheduling; it is more powerful than pld or data preloading. As well, you can use NEON or ARM64 instructions so that the body of the loop can do more eor operations before a load/store.
These days, this is done like this:
void invert(unsigned char* const outArr, unsigned char const* const inArr,
unsigned char const k, std::size_t const n) noexcept
{
std::transform(std::execution::unseq, inArr, inArr + n, outArr,
[k](auto const i)noexcept{return i ^ k;});
}
You set -Ofast, cross your fingers and hope that good code will be generated.
EDIT: You can also try this:
void invert(unsigned char* const outArr, unsigned char const* const inArr,
unsigned char const k, std::size_t const n) noexcept
{
std::transform(std::execution::unseq,
reinterpret_cast<std::uint32_t const*>(inArr),
reinterpret_cast<std::uint32_t const*>(inArr) + n/4,
reinterpret_cast<std::uint32_t*>(outArr),
[k=std::uint32_t(k<<24|k<<16|k<<8|k)](auto const i)noexcept{return i ^ k;});
}

Optimizing the backward solve for a sparse lower triangular linear system

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in
(A + I)' * x = b
This is the routine I have for computing this:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line
x[i] -= Lx[j] * x[Li[j]];
being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f gives
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
According to vtune,
vmovsd QWORD PTR [r9], xmm0
is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.
Processor: Xeon Skylake
Question What can I do to optimize this function?
This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0 and compiler flags -O3 -march=native (which is -march=skylake on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.
For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline which benchmarks the implementation given in the question and benchBacksolveOptimized which benchmarks the proposed "optimized" implementations. There is also benchFillRhs which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs takes should be subtracted.
1. Iterating strictly backwards
The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
I assume this is caused by somehow more predictable cache access, but I did not look into it much further.
2. Less loads/stores in inner loop
As A is lower triangular, we have i < Li[j]. Therefore we know that x[Li[j]] will not change due to the changes to x[i] in the inner loop. We can put this knowledge into our implementation by using a temporary variable:
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
This makes gcc 8.3.0 move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
3. Unroll the loop
While clang already starts unrolling the loop after the first suggested code change, gcc 8.3.0 still has not. So let's give that a try by additionally passing -funroll-loops.
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops again and implement the unrolling with something like this:
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
With that I measure:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
Other algorithms
All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.
You might shave a few cycles by using unsigned instead of int for the index types, which must be >= 0 anyway:
void backsolve(const unsigned * __restrict__ Lp,
const unsigned * __restrict__ Li,
const double * __restrict__ Lx,
const unsigned n,
double * __restrict__ x) {
for (unsigned i = n; i-- > 0; ) {
for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.
Here is the generated code for the inner loop:
.L8:
mov rax, rcx
.L5:
mov ecx, DWORD PTR [r10+rax*4]
vmovsd xmm1, QWORD PTR [r11+rax*8]
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
lea rcx, [rax+1]
vmovsd QWORD PTR [r9], xmm0
cmp rdi, rax
jne .L8

SIMD for float threshold operation

I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:
void func(const double* left, const double* right, double* res, const size_t size, const double th, const double drop) {
for (size_t i = 0; i < size; ++i) {
res[i] = right[i] >= th ? left[i] : (left[i] - drop) ;
}
}
Mainly, it drops the left value by drop in case right value is higher than threshold.
The size is around 128-256 (not that big), but computation is called heavily.
I tried to start with loop unrolling, but did not win a lot of performance, but may be some compile instructions are needed.
Could you please suggest some improvement into the code for faster computation?
Clang already auto-vectorizes this pretty much the way Soonts suggested doing manually. Use __restrict on your pointers so it doesn't need a fallback version that works for overlap between some of the arrays. It still auto-vectorizes, but it bloats the function.
Unfortunately gcc only auto-vectorizes with -ffast-math. It turns out only -fno-trapping-math is required: that's generally safe especially if you aren't using fenv access to unmask any FP exceptions (feenableexcept) or looking at MXCSR sticky FP exception flags (fetestexcept).
With that option, then GCC too will use (v)pblendvpd with -march=nehalem or -march=znver1. See it on Godbolt
Also, your C function is broken. th and drop are scalar double, but you declare them as const double *
AVX512F would let you do a !(right[i] >= thresh) compare and use the resulting mask for a merge-masked subtract.
Elements where the predicate was true will get left[i] - drop, other elements will keep their left[i] value, because you merge info a vector of left values.
Unfortunately GCC with -march=skylake-avx512 uses a normal vsubpd and then a separate vmovapd zmm2{k1}, zmm5 to blend, which is obviously a missed optimization. The blend destination is already one of the inputs to the SUB.
Using AVX512VL for 256-bit vectors (in case the rest of your program can't efficiently use 512-bit, so you don't suffer reduced turbo clock speeds):
__m256d left = ...;
__m256d right = ...;
__mmask8 cmp = _mm256_cmp_pd_mask(right, set1(th), _CMP_NGE_UQ);
__m256d res = _mm256_mask_sub_pd (left, cmp, left, set1(drop));
So (besides the loads and store) it's 2 instructions with AVX512F / VL.
If you don't need the specific NaN behaviour of your version, GCC can auto-vectorize too
And it's more efficient with all compilers because you just need an AND, not a variable-blend. So it's significantly better with just SSE2, and also better on most CPUs even when they do support SSE4.1 blendvpd, because that instruction isn't as efficient.
You can subtract 0.0 or drop from left[i] based on the compare result.
Producing 0.0 or a constant based on a compare result is extremely efficient: just an andps instruction. (The bit-pattern for 0.0 is all-zeros, and SIMD compares produce vectors of all-1 or all-0 bits. So AND keeps the old value or zeros it.)
We can also add -drop instead of subtracting drop. This costs an extra negation on input, but with AVX allows a memory-source operand for vaddpd. GCC chooses to use an indexed addressing mode so that doesn't actually help reduce the front-end uop count on Intel CPUs, though; it will "unlaminate". But even with -ffast-math, gcc doesn't do this optimization on its own to allow folding a load. (It wouldn't be worth doing separate pointer increments unless we unroll the loop, though.)
void func3(const double *__restrict left, const double *__restrict right, double *__restrict res,
const size_t size, const double th, const double drop)
{
for (size_t i = 0; i < size; ++i) {
double add = right[i] >= th ? 0.0 : -drop;
res[i] = left[i] + add;
}
}
GCC 9.1's inner loop (without any -march options and without -ffast-math) from the Godbolt link above:
# func3 main loop
# gcc -O3 -march=skylake (without fast-math)
.L33:
vcmplepd ymm2, ymm4, YMMWORD PTR [rsi+rax]
vandnpd ymm2, ymm2, ymm3
vaddpd ymm2, ymm2, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdx+rax], ymm2
add rax, 32
cmp r8, rax
jne .L33
Or the plain SSE2 version has an inner loop that's basically the same as with left - zero_or_drop instead of left + zero_or_minus_drop, so unless you can promise the compiler 16-byte alignment or you're making an AVX version, negating drop is just extra overhead.
Negating drop takes a constant from memory (to XOR the sign bit), and that's the only static constant this function needs, so that tradeoff is worth considering for your case where the loop doesn't run a huge number of times. (Unless th or drop are also compile-time constants after inlining, and are getting loaded anyway. Or especially if -drop can be computed at compile time. Or if you can get your program to work with a negative drop.)
Another difference between adding and subtracting is that subtracting doesn't destroy the sign of zero. -0.0 - 0.0 = -0.0, +0.0 - 0.0 = +0.0. In case that matters.
# gcc9.1 -O3
.L26:
movupd xmm5, XMMWORD PTR [rsi+rax]
movapd xmm2, xmm4 # duplicate th
movupd xmm6, XMMWORD PTR [rdi+rax]
cmplepd xmm2, xmm5 # destroy the copy of th
andnpd xmm2, xmm3 # _mm_andnot_pd
addpd xmm2, xmm6 # _mm_add_pd
movups XMMWORD PTR [rdx+rax], xmm2
add rax, 16
cmp r8, rax
jne .L26
GCC uses unaligned loads so (without AVX) it can't fold a memory source operand into cmppd or subpd
Here you go (untested), I’ve tried to explain in the comments what they do.
void func_sse41( const double* left, const double* right, double* res,
const size_t size, double th, double drop )
{
// Verify the size is even.
// If it's not, you'll need extra code at the end to process last value the old way.
assert( 0 == ( size % 2 ) );
// Load scalar values into 2 registers.
const __m128d threshold = _mm_set1_pd( th );
const __m128d dropVec = _mm_set1_pd( drop );
for( size_t i = 0; i < size; i += 2 )
{
// Load 4 double values into registers, 2 from right, 2 from left
const __m128d r = _mm_loadu_pd( right + i );
const __m128d l = _mm_loadu_pd( left + i );
// Compare ( r >= threshold ) for 2 values at once
const __m128d comp = _mm_cmpge_pd( r, threshold );
// Compute ( left[ i ] - drop ), for 2 values at once
const __m128d dropped = _mm_sub_pd( l, dropVec );
// Select either left or ( left - drop ) based on the comparison.
// This is the only instruction here that requires SSE 4.1.
const __m128d result = _mm_blendv_pd( l, dropped, comp );
// Store the 2 result values
_mm_storeu_pd( res, result );
}
}
The code will crash with “invalid instruction” runtime error if the CPU doesn’t have SSE 4.1. For best result, detect with CPU ID to fail gracefully. I think now in 2019 it’s quite reasonable to assume it’s supported, Intel did in 2008, AMD in 2011, steam survey says “96.3%”. If you want to support older CPUs, possible to emulate _mm_blendv_pd with 3 other instructions, _mm_and_pd, _mm_andnot_pd, _mm_or_pd.
If you can guarantee the data is aligned, replacing loads with _mm_load_pd will be slightly faster, _mm_cmpge_pd compiles into CMPPD https://www.felixcloutier.com/x86/cmppd which can take one of the arguments directly from RAM.
Potentially, you can get further 2x improvement by writing AVX version. But I hope even SSE version is faster than your code, it handles 2 values per iteration, and doesn’t have conditions inside the loop. If you’re unlucky, AVX will be slower, many CPUs need some time to power on their AVX units, takes many thousands of cycles. Until powered, AVX code runs very slowly.
You can use GCC's and Clang's vector extensions to implement a ternary select function (see https://stackoverflow.com/a/48538557/2542702).
#include <stddef.h>
#include <inttypes.h>
#if defined(__clang__)
typedef double double4 __attribute__ ((ext_vector_type(4)));
typedef int64_t long4 __attribute__ ((ext_vector_type(4)));
#else
typedef double double4 __attribute__ ((vector_size (sizeof(double)*4)));
typedef int64_t long4 __attribute__ ((vector_size (sizeof(int64_t)*4)));
#endif
double4 select(long4 s, double4 a, double4 b) {
double4 c;
#if defined(__GNUC__) && !defined(__INTEL_COMPILER) && !defined(__clang__)
c = s ? a : b;
#else
for(int i=0; i<4; i++) c[i] = s[i] ? a[i] : b[i];
#endif
return c;
}
void func(double* left, double* right, double* res, size_t size, double th, double drop) {
size_t i;
for (i = 0; i<(size&-4); i+=4) {
double4 leftv = *(double4*)&left[i];
double4 rightv = *(double4*)&right[i];
*(double4*)&res[i] = select(rightv >= th, leftv, leftv - drop);
}
for(;i<size; i++) res[i] = right[i] >= th ? left[i] : (left[i] - drop);
}
https://godbolt.org/z/h4OKMl

SIMD XOR operation is not as effective as Integer XOR?

I have a task to calculate xor-sum of bytes in an array:
X = char1 XOR char2 XOR char3 ... charN;
I'm trying to parallelize it, xoring __m128 instead. This should give speed up factor 4.
Also, to recheck the algorithm I use int. This should give speed up factor 4.
The test program is 100 lines long, I can't make it shorter, but it is simple:
#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>
#include <iostream>
using namespace std;
#include <stdlib.h> // rand
const int NIter = 100;
const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));
template< typename T >
T Sum(const T* data, const int N)
{
T sum = 0;
for ( int i = 0; i < N; ++i )
sum = sum ^ data[i];
return sum;
}
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum = _mm_set_ps1(0);
for ( int i = 0; i < N; ++i )
sum = _mm_xor_ps(sum,data[i]);
return sum;
}
int main() {
// fill string by random values
for( int i = 0; i < N; i++ ) {
str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
}
/// -- CALCULATE --
/// SCALAR
unsigned char sumS = 0;
std::clock_t c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ )
sumS = Sum<unsigned char>( str, N );
double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SIMD
unsigned char sumV = 0;
const int m128CharLen = 4*4;
const int NV = N/m128CharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
__m128 sumVV = _mm_set_ps1(0);
sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);
sumV = sumVS[0];
for ( int iE = 1; iE < m128CharLen; ++iE )
sumV ^= sumVS[iE];
}
double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SCALAR INTEGER
unsigned char sumI = 0;
const int intCharLen = 4;
const int NI = N/intCharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);
sumI = sumIS[0];
for ( int iE = 1; iE < intCharLen; ++iE )
sumI ^= sumIS[iE];
}
double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// -- OUTPUT --
cout << "Time scalar: " << tScal << " ms " << endl;
cout << "Time INT: " << tINT << " ms, speed up " << tScal/tINT << endl;
cout << "Time SIMD: " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;
if(sumV == sumS && sumI == sumS )
std::cout << "Results are the same." << std::endl;
else
std::cout << "ERROR! Results are not the same." << std::endl;
return 1;
}
The typical results:
[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 290 ms, speed up 12.5517
Results are the same.
As you see, int version works ideally, but simd version loses 25% of the speed and this is stable. I tried to change the array sizes, this doesn't help.
Also, if I switch to -O2 I lose 75% of the speed in simd version:
[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 900 ms, speed up 4.04444
Time SIMD: 880 ms, speed up 4.13636
Results are the same.
Can someone explain me this?
Additional info:
I have g++ (GCC) 4.7.3; Intel(R) Xeon(R) CPU E7-4860
I use -fno-tree-vectorize to prevent auto vectorization. Without this flag with -O3 the
expected speed up is 1, since the task is simple. This is what I get:
[10:55:40]$ g++ test.cpp -O3; ./a.out
Time scalar: 270 ms
Time INT: 270 ms, speed up 1
Time SIMD: 280 ms, speed up 0.964286
Results are the same.
but with -O2 result is still strange:
[10:55:02]$ g++ test.cpp -O2; ./a.out
Time scalar: 3540 ms
Time INT: 990 ms, speed up 3.57576
Time SIMD: 880 ms, speed up 4.02273
Results are the same.
When I change
for ( int i = 0; i < N; i+=1 )
sum = sum ^ data[i];
to equivalent of:
for ( int i = 0; i < N; i+=8 )
sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;
i do see improvment in scalar speed by factor of 2. But I don't see improvements in speed up. Before: intSpeedUp 3.98416, SIMDSpeedUP 12.5283. After: intSpeedUp 3.5572, SIMDSpeedUP 6.8523.
I think you may be bumping into the upper limits of memory bandwidth. This might be the reason for the 12.6x speedup instead of 16x speedup in the -O3 case.
However, gcc 4.7.3 puts a useless store instruction into the tiny not-unrolled vector loop when inlining, but not in the scalar or int SWAR loops (see below), so that might be the explanation instead.
The -O2 reduction in vector throughput is all due to gcc 4.7.3 doing an even worse job there and sending the accumulator on a round trip to memory (store-forwarding).
For analysis of the implications of that extra store instruction, see the section at the end.
TL;DR: Nehalem likes a bit more loop unrolling than SnB-family requires, and gcc has made major improvements in SSE code-generation in gcc5.
And typically use _mm_xor_si128, not _mm_xor_ps for bulk xor work like this.
Memory bandwidth.
N is huge (40MB), so memory/cache bandwidth is a concern. A Xeon E7-4860 is a 32nm Nehalem microarchitecture, with 256kiB of L2 cache (per core), and 24MiB of shared L3 cache. It has a quad-channel memory controller supporting up to DDR3-1066 (compared to dual-channel DDR3-1333 or DDR3-1600 for typical desktop CPUs like SnB or Haswell).
A typical 3GHz desktop Intel CPU can sustain a load bandwidth of something like ~8B / cycle from DRAM, in theory. (e.g. 25.6GB/s theoretical max memory BW for an i5-4670 with dual channel DDR3-1600). Achieving this in an actual single thread might not work, esp. when using integer 4B or 8B loads. For a slower CPU like a 2267MHz Nehalem Xeon, with quad-channel (but also slower) memory, 16B per clock is probably pushing the upper limits.
I had a look at the asm from the original unchanged code with gcc 4.7.3 on godbolt.
The stand-alone version looks fine (but the inlined version isn't), see below!), with the loop being
## float __vector Sum(...) non-inlined version
.L3:
xorps xmm0, XMMWORD PTR [rdi]
add rdi, 16
cmp rdi, rax
jne .L3
That's 3 fused-domain uops, and should issue and execute at one iteration per clock. Actually, it can't because xorps and fused compare-and-branch both need port5.
N is huge, so the overhead of the clunky char-at-a-time horizontal XOR doesn't come into play, even though gcc 4.7 emits abysmal code for it (multiple copies of sumVV stored to the stack, etc. etc.). (See Fastest way to do horizontal float vector sum on x86 for ways to reduce down to 4B with SIMD. It might be faster to then movd the data into integer regs and use integer shift/xor there for the last 4B -> 1B, esp. if you're not using AVX. The compiler might be able to take advantage of al/ah low and high 8bit component regs.)
The vector loop was inlined stupidly:
## float __vector Sum(...) inlined into main at -O3
.L12:
xorps xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+64], xmm0
jne .L12
It's storing the accumulator every iteration, instead of just after the last iteration! Since gcc doesn't / didn't default to optimizing for macro-fusion, it didn't even put the cmp/jne next to each other where they can fuse into a single uop on Intel and AMD CPUs, so the loop has 5 fused-domain uops. This means it can only issue at one per 2 clocks, if the Nehalem frontend / loop buffer is anything like the Sandybridge loop buffer. uops issue in groups of 4, and a predicted-taken branch ends an issue block. So it issues in a 4/1/4/1 uop pattern, not 4/4/4/4. This means we can get at best one 16B load per 2 clocks of sustained throughput.
-mtune=core2 might double the throughput, because it puts the cmp/jne together. The store can micro-fuse into a single uop, and so can the xorps with a memory source operand. A gcc that old doesn't support -mtune=nehalem, or the more generic -mtune=intel. Nehalem can sustain one load and one store per clock, but obviously it would be far better not to have a store in the loop at all.
Compiling with -O2 makes even worse code with that gcc version:
The inlined inner loop now loads the accumulator from memory as well as storing it, so there's a store-forwarding round trip in the loop-carried dependency that the accumulator is part of:
## float __vector Sum(...) inlined at -O2
.L14:
movaps xmm0, XMMWORD PTR [rsp+16] # reload sum
xorps xmm0, XMMWORD PTR [rdx] # load data[i]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+16], xmm0 # spill sum
jne .L14
At least with -O2 the horizontal byte-xor compiles to just a plain integer byte loop without spewing 15 copies copies of xmm0 onto the stack.
This is just totally braindead code, because we haven't let a reference / pointer to sumVV escape the function, so there are no other threads that could be observing the accumulator in progress. (And even if so, there's no synchronization stopping gcc from just accumulating in a reg and storing the final result). The non-inlined version is still fine.
That massive performance bug is still present all the way up to gcc 4.9.2, with -O2 -fno-tree-vectorize, even when I rename the function from main to something else, so it gets the full benefit of gcc's optimization efforts. (Don't put microbenchmarks inside main, because gcc marks it as "cold" and optimizes less.)
gcc 5.1 makes good code for the inlined version of template<>
__m128 Sum(const __m128* data, const int N). I didn't check with clang.
This extra loop-carried dep chain is almost certainly why the vector version has a smaller speedup with -O2. i.e. it's a compiler bug that's fixed in gcc5.
The scalar version with -O2 is
.L12:
xor bpl, BYTE PTR [rdx] # sumS, MEM[base: D.27594_156, offset: 0B]
add rdx, 1 # ivtmp.135,
cmp rdx, rbx # ivtmp.135, D.27613
jne .L12 #,
so it's basically optimal. Nehalem can only sustain one load per clock, so there's no need to use more accumulators.
The int version is
.L18:
xor ecx, DWORD PTR [rdx] # sum, MEM[base: D.27549_296, offset: 0B]
add rdx, 4 # ivtmp.135,
cmp rbx, rdx # D.27613, ivtmp.135
jne .L18 #,
so again, it's what you'd expect. It should be sustaining on load per clock.
For uarches that can sustain two loads per clock (Intel SnB-family, and AMD), you should be using two accumulators. compiler-implemented -funroll-loops usually just reduces loop overhead without introducing multiple accumulators. :(
You want the compiler to make code like:
xorps xmm0, xmm0
xorps xmm1, xmm1
.Lunrolled:
pxor xmm0, XMMWORD PTR [rdi]
pxor xmm1, XMMWORD PTR [rdi+16]
pxor xmm0, XMMWORD PTR [rdi+32]
pxor xmm1, XMMWORD PTR [rdi+48]
add rdi, 64
cmp rdi, rax
jb .Lunrolled
pxor xmm0, xmm1
# horizontal xor of xmm0
movhlps xmm1, xmm0
pxor xmm0, xmm1
...
Urolling by two (pxor / pxor / add / cmp/jne) would make a loop that can issue at one iteration per 1c, but requires four ALU execution ports. Only Haswell and later can keep up with that throughput. (Or AMD Bulldozer-family, because vector and integer instructions don't compete for execution ports, but conversely there are only two integer ALU pipes, so they only max out their instruction throughput with mixed code.)
This unroll by four is 6 fused-domain uops in the loop, so it can easily issue at one per 2c, and SnB/IvB can keep up with three ALU uops per clock.
Note that on Intel Nehalem through Broadwell, pxor (_mm_xor_si128) has better throughput than xorps (_mm_xor_ps), because it can run on more execution ports. If you're using AVX but not AVX2, it can make sense to use 256b _mm256_xor_ps instead of _mm_xor_si128, because _mm256_xor_si256 requires AVX2.
If it's not memory bandwidth, why is it only 12.6x speedup?
Nehalem's loop buffer (aka Loop Stream Decoder or LSD) has a "one clock delay" (according to Agner Fog's microarch pdf), so a loop with N uops will take ceil(N/4.0) + 1 cycles to issue out of the loop buffer if I understand him correctly. He doesn't explicitly say what happens to the last group of uops if there are less than 4, but SnB-family CPUs work this way (divide by 4 and round up). They can't issue uops from the next iteration following the taken branch. I tried to google about nehalem, but couldn't find anything useful.
So the char and int loops are presumably running at one load & xor per 2 clocks (since they're 3 fused-domain uops). Loop unrolling could ~double their throughput up to the point where they saturate the load port. SnB-family CPUs don't have that one clock delay, so they can run tiny loops at one clock per iteration.
Using perf counters or at least microbenchmarks to make sure that your absolute throughput is what you expect is a good idea. With just your relative measurements, you have no indication without this kind of analysis that you're leaving half your performance on the table.
The vector -O3 loop is 5 fused-domain uops, so it should be taking three clock cycles to issue. Doing 16x as much work, but taking 3 cycles per iteration instead of 2 would give us a speedup of 16 * 2/3 = 10.66. We're actually getting somewhat better than that, which I don't understand.
I'm going to stop here, instead of digging out a nehalem laptop and running actual benchmarks, since Nehalem is too old to be interesting to tune for at this level of detail.
Did you maybe compile with -mtune=core2? Or maybe your gcc had a different default tune setting, and didn't split up the compare-and-branch? In that case, the frontend probably wasn't the bottleneck, and throughput was maybe slightly limited by memory bandwidth, or by memory false dependencies:
Core 2 and Nehalem both have a false dependence between memory
addresses with the same set and offset, i.e. with a distance that is a
multiple of 4 kB.
This might cause a short bubble in the pipeline every 4k.
Before I checked on Nehalem's loop buffer and found the extra 1c per loop, I had a theory which I'm now confident is incorrect:
I thought the extra store uop in the loop that bumps it up over 4 uops would essentially cut the speed in half, so you'd see a speedup of ~6. However, maybe there are some execution bottlenecks that make the frontend issue throughput not the bottleneck after all?
Or maybe Nehalem's loop buffer is different from SnB's, and doesn't end an issue group at a predicted-taken branch. This would give a thoughput speedup of 16 * 4/5 = 12.8, for the -O3 vector loop, if it's 5 fused-domain uops can issue at a consistent 4 per clock. This matches the experimental data of 12.6429 speedup factor very well: slightly less than 12.8 is to be expected because of increased bandwidth requirements (occasional cache miss stalls when the prefetcher falls behind).
(The scalar loops still just run one iteration per clock: issuing more than one iteration per clock just means they bottleneck on one load per clock, and the 1 cycle xor loop-carried dependency.)
This can't be right because xorps in Nehalem can only run on port5, same as a fused compare-and-branch. So there's no way the non-unrolled vector loop could be running at more than one iteration per 2 cycles.
According to Agner Fog's tables, conditional branches have a throughput of one per 2c on Nehalem, further confirming that this is a bogus theory.
SSE2 is optimal when operating on completely parallel data. e.g.
for (int i = 0 ; i < N ; ++i)
z[i] = _mm_xor_ps(x[i], y[i]);
But in your case, each iteration of the loop depends upon the output of the previous iteration. This is known as a dependency chain. In short, it means that each consecutive xor is going to have to wait for the entire latency of the previous one before it can continue so it lowers the throughput.
jaket has already explained the likely problem: a dependency chain. I'll give it a try:
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum1 = _mm_set_ps1(0);
__m128 sum2 = _mm_set_ps1(0);
for (int i = 0; i < N; i += 2) {
sum1 = _mm_xor_ps(sum1, data[i + 0]);
sum2 = _mm_xor_ps(sum2, data[i + 1]);
}
return _mm_xor_ps(sum1, sum2);
}
Now there are no dependencies at all between the two lanes. Try expanding this to more lanes (e.g. 4).
You could also try using the integer version of these instructions (using __m128i). I do not understand the difference so this is just a hint.
In fact, the gcc compiler is optimized for SIMD. It explains why when you used -O2 the perf decreases significantly. You can re-check with -O1.