Loop unroll issue with Visual Studio compiler

Loop unroll issue with Visual Studio compiler - c++

I have some simple setup, where I noticed that VS compiler seems not smart enough to unroll loop, but other compilers like clang or gcc do so. Do I miss some optimization flag for VS?
#include <cstddef>
struct A
{
double data[4];
double *begin() { return data; }
double *end() { return data + 4; }
double const *begin() const { return data; }
double const *end() const { return data + 4; }
};
double sum_index(A const &a) {
double ret = 0;
for(std::size_t i = 0; i < 4; ++i)
{
ret += a.data[i];
}
return ret;
}
double sum_iter(A const &a) {
double ret = 0;
for(auto const &v : a)
{
ret += v;
}
return ret;
}
I used https://godbolt.org/ compiler explorer to generate assembler code.
gcc 11.2 with -O3:
sum_index(A const&):
pxor xmm0, xmm0
addsd xmm0, QWORD PTR [rdi]
addsd xmm0, QWORD PTR [rdi+8]
addsd xmm0, QWORD PTR [rdi+16]
addsd xmm0, QWORD PTR [rdi+24]
ret
sum_iter(A const&):
movsd xmm1, QWORD PTR [rdi]
addsd xmm1, QWORD PTR .LC0[rip]
movsd xmm0, QWORD PTR [rdi+8]
addsd xmm1, xmm0
movupd xmm0, XMMWORD PTR [rdi+16]
addsd xmm1, xmm0
unpckhpd xmm0, xmm0
addsd xmm0, xmm1
ret
.LC0:
.long 0
.long 0
clang 13.0.1 with -O3:
sum_index(A const&): # #sum_index(A const&)
xorpd xmm0, xmm0
addsd xmm0, qword ptr [rdi]
addsd xmm0, qword ptr [rdi + 8]
addsd xmm0, qword ptr [rdi + 16]
addsd xmm0, qword ptr [rdi + 24]
ret
sum_iter(A const&): # #sum_iter(A const&)
xorpd xmm0, xmm0
addsd xmm0, qword ptr [rdi]
addsd xmm0, qword ptr [rdi + 8]
addsd xmm0, qword ptr [rdi + 16]
addsd xmm0, qword ptr [rdi + 24]
ret
MSVC 19.30 with /O2 (there is no /O3?):
this$ = 8
double const * A::begin(void)const PROC ; A::begin, COMDAT
mov rax, rcx
ret 0
double const * A::begin(void)const ENDP ; A::begin
this$ = 8
double const * A::end(void)const PROC ; A::end, COMDAT
lea rax, QWORD PTR [rcx+32]
ret 0
double const * A::end(void)const ENDP ; A::end
a$ = 8
double sum_index(A const &) PROC ; sum_index, COMDAT
movsd xmm0, QWORD PTR [rcx]
xorps xmm1, xmm1
addsd xmm0, xmm1
addsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rcx+16]
addsd xmm0, QWORD PTR [rcx+24]
ret 0
double sum_index(A const &) ENDP ; sum_index
a$ = 8
double sum_iter(A const &) PROC ; sum_iter, COMDAT
lea rax, QWORD PTR [rcx+32]
xorps xmm0, xmm0
cmp rcx, rax
je SHORT $LN12#sum_iter
npad 4
$LL8#sum_iter:
addsd xmm0, QWORD PTR [rcx]
add rcx, 8
cmp rcx, rax
jne SHORT $LL8#sum_iter
$LN12#sum_iter:
ret 0
double sum_iter(A const &) ENDP ; sum_iter
Obviously there is problem with unrolling the loop for MSVC. Is there some additional optimization flag I have to set?
Thanks for help!

Related

Does pointer to pointer dereference has any performance impact?

Which program is better?
Does storing pointer reference in a local copy make a considerable difference if dereferencing is done often?
Program1:
void program1(){
for (int i =0; i<1000; i++)
a[i] = ptr1->ptr2->ptr3->structure[i].variable;
}
Program2
void program2(){
int* local_copy = ptr1->ptr2;
for (int i =0; i<1000; i++)
a[i] = local_copy->structure[i].variable;
}

Small test : https://godbolt.org/z/Pesfsoq1z
Inner loop :
a[n] = ptr1->ptr2->ptr3->structure[n].variable;
Compiles to :
mov rcx, qword ptr [rax + 32]
mov qword ptr [rsp + 32], rcx
movups xmm0, xmmword ptr [rax]
movups xmm1, xmmword ptr [rax + 16]
movaps xmmword ptr [rsp + 16], xmm1
movaps xmmword ptr [rsp], xmm0
xor ebx, ebx
And inner loop :
a[n] = p->structure[n].variable;
Compiles to :
mov rcx, qword ptr [rax + 32]
mov qword ptr [rsp + 32], rcx
movups xmm0, xmmword ptr [rax]
movups xmm1, xmmword ptr [rax + 16]
movaps xmmword ptr [rsp + 16], xmm1
movaps xmmword ptr [rsp], xmm0
xor ebx, ebx
Which is the exact same assembly, so the answer to your question (for clang 15) is : It doesn't matter. (The compiler has spotted the invariants and optimized)
PS. I used unique pointers to not have memory leaks. It is a small overhead to pay

Differences in custom and std fetch_add on floats

This is an attempt at implementing fetch_add on floats without C++20.
void fetch_add(volatile float* x, float y)
{
bool success = false;
auto xi = (volatile std::int32_t*)x;
while(!success)
{
union {
std::int32_t sumint;
float sum;
};
auto tmp = __atomic_load_n(xi, __ATOMIC_RELAXED);
sumint = tmp;
sum += y;
success = __atomic_compare_exchange_n(xi, &tmp, sumint, true, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}
}
To my great confusion, when I compare the assembly from gcc10.1 -O2 -std=c++2a for x86-64, they differ.
fetch_add(float volatile*, float):
.L2:
mov eax, DWORD PTR [rdi]
movd xmm1, eax
addss xmm1, xmm0
movd edx, xmm1
lock cmpxchg DWORD PTR [rdi], edx
jne .L2
ret
fetch_add_std(std::atomic<float>&, float):
mov eax, DWORD PTR [rdi]
movaps xmm1, xmm0
movd xmm0, eax
mov DWORD PTR [rsp-4], eax
addss xmm0, xmm1
.L9:
mov eax, DWORD PTR [rsp-4]
movd edx, xmm0
lock cmpxchg DWORD PTR [rdi], edx
je .L6
mov DWORD PTR [rsp-4], eax
movss xmm0, DWORD PTR [rsp-4]
addss xmm0, xmm1
jmp .L9
.L6:
ret
My ability to read assembly is near non-existent, but the custom version looks correct to me, which implies it is either incorrect, inefficient or somehow the standard library is rather broken. I don't quite believe the third case, which leads me to ask, is the custom version incorrect or inefficient?
After some comments, a second version without reloading after cmpxchg is written. They do still differ.

Loop unrolling and SSE -- clang vs gcc

Disclaimer: full code can be found here.
16 byte alignment
Given a fairly simple type to support proper SSE alignment
struct alignas(16) simd_pack
{
std::int32_t data[4];
};
and a function that adds two arrays together
void add_packed(simd_pack* lhs_and_result, simd_pack* rhs, std::size_t size)
{
for (std::size_t i = 0; i < size; i++)
for (std::size_t j = 0; j < 4; j++)
lhs_and_result[i].data[j] += rhs[i].data[j];
}
compile the code with clang and gcc using -O3.
Clang produces the following assembly:
add_packed(simd_pack*, simd_pack*, unsigned long): # #add_packed(simd_pack*, simd_pack*, unsigned long)
test rdx, rdx
je .LBB0_3
mov eax, 12
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov ecx, dword ptr [rsi + rax - 12]
add dword ptr [rdi + rax - 12], ecx
mov ecx, dword ptr [rsi + rax - 8]
add dword ptr [rdi + rax - 8], ecx
mov ecx, dword ptr [rsi + rax - 4]
add dword ptr [rdi + rax - 4], ecx
mov ecx, dword ptr [rsi + rax]
add dword ptr [rdi + rax], ecx
add rax, 16
add rdx, -1
jne .LBB0_2
.LBB0_3:
ret
I'm not very literate in assembly but to me it looks like clang is simply unrolling the inner for loop. If we take a look at gcc we get:
add_packed(simd_pack*, simd_pack*, unsigned long):
test rdx, rdx
je .L1
sal rdx, 4
xor eax, eax
.L3:
movdqa xmm0, XMMWORD PTR [rdi+rax]
paddd xmm0, XMMWORD PTR [rsi+rax]
movaps XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, rdx
jne .L3
.L1:
ret
which is what I expect.
64 byte alignment
The difference gets even bigger (obviously) if we go to 64 byte alignment (which usually is a cache line if I'm not mistaken)
struct alignas(64) cache_line
{
std::int32_t data[16];
};
void add_cache_line(cache_line* lhs_and_result, cache_line* rhs, std::size_t size)
{
for (std::size_t i = 0; i < size; i++)
for (std::size_t j = 0; j < 16; j++)
lhs_and_result[i].data[j] += rhs[i].data[j];
}
Clang keeps simply unrolling:
add_cache_line(cache_line*, cache_line*, unsigned long): # #add_cache_line(cache_line*, cache_line*, unsigned long)
test rdx, rdx
je .LBB1_3
mov eax, 60
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov ecx, dword ptr [rsi + rax - 60]
add dword ptr [rdi + rax - 60], ecx
mov ecx, dword ptr [rsi + rax - 56]
add dword ptr [rdi + rax - 56], ecx
mov ecx, dword ptr [rsi + rax - 52]
add dword ptr [rdi + rax - 52], ecx
mov ecx, dword ptr [rsi + rax - 48]
add dword ptr [rdi + rax - 48], ecx
mov ecx, dword ptr [rsi + rax - 44]
add dword ptr [rdi + rax - 44], ecx
mov ecx, dword ptr [rsi + rax - 40]
add dword ptr [rdi + rax - 40], ecx
mov ecx, dword ptr [rsi + rax - 36]
add dword ptr [rdi + rax - 36], ecx
mov ecx, dword ptr [rsi + rax - 32]
add dword ptr [rdi + rax - 32], ecx
mov ecx, dword ptr [rsi + rax - 28]
add dword ptr [rdi + rax - 28], ecx
mov ecx, dword ptr [rsi + rax - 24]
add dword ptr [rdi + rax - 24], ecx
mov ecx, dword ptr [rsi + rax - 20]
add dword ptr [rdi + rax - 20], ecx
mov ecx, dword ptr [rsi + rax - 16]
add dword ptr [rdi + rax - 16], ecx
mov ecx, dword ptr [rsi + rax - 12]
add dword ptr [rdi + rax - 12], ecx
mov ecx, dword ptr [rsi + rax - 8]
add dword ptr [rdi + rax - 8], ecx
mov ecx, dword ptr [rsi + rax - 4]
add dword ptr [rdi + rax - 4], ecx
mov ecx, dword ptr [rsi + rax]
add dword ptr [rdi + rax], ecx
add rax, 64
add rdx, -1
jne .LBB1_2
.LBB1_3:
ret
while gcc uses SSE and also unrolls that:
add_cache_line(cache_line*, cache_line*, unsigned long):
mov rcx, rdx
test rdx, rdx
je .L9
sal rcx, 6
mov rax, rdi
mov rdx, rsi
add rcx, rdi
.L11:
movdqa xmm2, XMMWORD PTR [rdx+16]
movdqa xmm3, XMMWORD PTR [rax]
add rax, 64
add rdx, 64
movdqa xmm1, XMMWORD PTR [rdx-32]
movdqa xmm0, XMMWORD PTR [rdx-16]
paddd xmm3, XMMWORD PTR [rdx-64]
paddd xmm2, XMMWORD PTR [rax-48]
paddd xmm1, XMMWORD PTR [rax-32]
paddd xmm0, XMMWORD PTR [rax-16]
movaps XMMWORD PTR [rax-64], xmm3
movaps XMMWORD PTR [rax-48], xmm2
movaps XMMWORD PTR [rax-32], xmm1
movaps XMMWORD PTR [rax-16], xmm0
cmp rax, rcx
jne .L11
.L9:
ret
No alignment
It's getting interesting if we use plain 32 bit integer arrays with no alignment at all. We use the exact same compiler flags.
void add_unaligned(std::int32_t* lhs_and_result, std::int32_t* rhs, std::size_t size)
{
for (std::size_t i = 0; i < size; i++)
lhs_and_result[i] += rhs[i];
}
Clang
Clang's assembly exploaded a fair bit by adding some branches:
add_unaligned(int*, int*, unsigned long): # #add_unaligned(int*, int*, unsigned long)
test rdx, rdx
je .LBB2_16
cmp rdx, 7
jbe .LBB2_2
lea rax, [rsi + 4*rdx]
cmp rax, rdi
jbe .LBB2_9
lea rax, [rdi + 4*rdx]
cmp rax, rsi
jbe .LBB2_9
.LBB2_2:
xor r10d, r10d
.LBB2_3:
mov r8, r10
not r8
add r8, rdx
mov rcx, rdx
and rcx, 3
je .LBB2_5
.LBB2_4: # =>This Inner Loop Header: Depth=1
mov eax, dword ptr [rsi + 4*r10]
add dword ptr [rdi + 4*r10], eax
add r10, 1
add rcx, -1
jne .LBB2_4
.LBB2_5:
cmp r8, 3
jb .LBB2_16
.LBB2_6: # =>This Inner Loop Header: Depth=1
mov eax, dword ptr [rsi + 4*r10]
add dword ptr [rdi + 4*r10], eax
mov eax, dword ptr [rsi + 4*r10 + 4]
add dword ptr [rdi + 4*r10 + 4], eax
mov eax, dword ptr [rsi + 4*r10 + 8]
add dword ptr [rdi + 4*r10 + 8], eax
mov eax, dword ptr [rsi + 4*r10 + 12]
add dword ptr [rdi + 4*r10 + 12], eax
add r10, 4
cmp rdx, r10
jne .LBB2_6
jmp .LBB2_16
.LBB2_9:
mov r10, rdx
and r10, -8
lea rax, [r10 - 8]
mov r9, rax
shr r9, 3
add r9, 1
mov r8d, r9d
and r8d, 1
test rax, rax
je .LBB2_10
sub r9, r8
xor ecx, ecx
.LBB2_12: # =>This Inner Loop Header: Depth=1
movdqu xmm0, xmmword ptr [rsi + 4*rcx]
movdqu xmm1, xmmword ptr [rsi + 4*rcx + 16]
movdqu xmm2, xmmword ptr [rdi + 4*rcx]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rcx + 16]
paddd xmm0, xmm1
movdqu xmm1, xmmword ptr [rdi + 4*rcx + 32]
movdqu xmm3, xmmword ptr [rdi + 4*rcx + 48]
movdqu xmmword ptr [rdi + 4*rcx], xmm2
movdqu xmmword ptr [rdi + 4*rcx + 16], xmm0
movdqu xmm0, xmmword ptr [rsi + 4*rcx + 32]
paddd xmm0, xmm1
movdqu xmm1, xmmword ptr [rsi + 4*rcx + 48]
paddd xmm1, xmm3
movdqu xmmword ptr [rdi + 4*rcx + 32], xmm0
movdqu xmmword ptr [rdi + 4*rcx + 48], xmm1
add rcx, 16
add r9, -2
jne .LBB2_12
test r8, r8
je .LBB2_15
.LBB2_14:
movdqu xmm0, xmmword ptr [rsi + 4*rcx]
movdqu xmm1, xmmword ptr [rsi + 4*rcx + 16]
movdqu xmm2, xmmword ptr [rdi + 4*rcx]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rcx + 16]
paddd xmm0, xmm1
movdqu xmmword ptr [rdi + 4*rcx], xmm2
movdqu xmmword ptr [rdi + 4*rcx + 16], xmm0
.LBB2_15:
cmp r10, rdx
jne .LBB2_3
.LBB2_16:
ret
.LBB2_10:
xor ecx, ecx
test r8, r8
jne .LBB2_14
jmp .LBB2_15
What is happening at .LBB2_4 and .LBB2_6? It looks like it's unrolling a loop again but I'm not sure what happens there (mainly because of the registers used).
In .LBB2_12 it even unrolls the SSE part. I think it's only unrolled two-fold though because it needs two SIMD registers to load each operand because they are unaligned now. .LBB2_14 contains the SSE part without the unrolling.
How is the control flow here? I'm assuming it should be:
keep using the unrolled SSE part until the remaining data is too small to fill all the registers (xmm0..3)
switch to the single stage SSE part and do it once if we have enough data remaining to fill xmm0 (4 integers in our case)
process the remaining data (3 operations at max, otherwise it would be SSE suitable again)
The order of the labels and the jump instructions are confusing, is that (approx.) what happens here?
GCC
Gcc's assembly is a bit easier to read:
add_unaligned(int*, int*, unsigned long):
test rdx, rdx
je .L16
lea rcx, [rsi+4]
mov rax, rdi
sub rax, rcx
cmp rax, 8
jbe .L22
lea rax, [rdx-1]
cmp rax, 2
jbe .L22
mov rcx, rdx
xor eax, eax
shr rcx, 2
sal rcx, 4
.L19:
movdqu xmm0, XMMWORD PTR [rdi+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
paddd xmm0, xmm1
movups XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, rcx
jne .L19
mov rax, rdx
and rax, -4
test dl, 3
je .L16
mov ecx, DWORD PTR [rsi+rax*4]
add DWORD PTR [rdi+rax*4], ecx
lea rcx, [rax+1]
cmp rdx, rcx
jbe .L16
add rax, 2
mov r8d, DWORD PTR [rsi+rcx*4]
add DWORD PTR [rdi+rcx*4], r8d
cmp rdx, rax
jbe .L16
mov edx, DWORD PTR [rsi+rax*4]
add DWORD PTR [rdi+rax*4], edx
ret
.L22:
xor eax, eax
.L18:
mov ecx, DWORD PTR [rsi+rax*4]
add DWORD PTR [rdi+rax*4], ecx
add rax, 1
cmp rdx, rax
jne .L18
.L16:
ret
I assume the control flow is similar to clang
keep using the single stage SSE part until the remaining data is too small to fill xmm0 and xmm1
process the remaining data (3 operations at max, otherwise it would be SSE suitable again)
It looks like exactly this is happening in .L19 but what is .L18 doing then?
Summary
Here is the full code, including assembly. My question are:
Why is clang unrolling the functions that use aligned data instead of using SSE or a combination of both (like gcc)?
What are .LBB2_4 and .LBB2_6 in clang's assembly doing?
Are my assumptions about the control flow of the function with the unaligned data correct?
What is .L18 in gcc's assembly doing?

GCC std::sin vectorization bug?

The next code (with -O3 -ffast-math):
#include <cmath>
float a[4];
void sin1() {
for(unsigned i = 0; i < 4; i++) a[i] = sinf(a[i]);
}
Compiles vectorized version of sinf (_ZGVbN4v_sinf):
sin1():
sub rsp, 8
movaps xmm0, XMMWORD PTR a[rip]
call _ZGVbN4v_sinf
movaps XMMWORD PTR a[rip], xmm0
add rsp, 8
ret
But when i use c++ version of sinf (std::sin) no vectorization occurrs:
void sin2() {
for(unsigned i = 0; i < 4; i++) a[i] = std::sin(a[i]);
}
sin2():
sub rsp, 8
movss xmm0, DWORD PTR a[rip]
call sinf
movss DWORD PTR a[rip], xmm0
movss xmm0, DWORD PTR a[rip+4]
call sinf
movss DWORD PTR a[rip+4], xmm0
movss xmm0, DWORD PTR a[rip+8]
call sinf
movss DWORD PTR a[rip+8], xmm0
movss xmm0, DWORD PTR a[rip+12]
call sinf
movss DWORD PTR a[rip+12], xmm0
add rsp, 8
ret
Compiler Explorer Code

SSE2 - 16-byte aligned dynamic allocation of memory

EDIT:
This is a followup to SSE2 Compiler Error
This is the real bug I experienced before and have reproduced below by changing the _mm_malloc statement as Michael Burr suggested:
Unhandled exception at 0x00415116 in SO.exe: 0xC0000005: Access violation reading
location 0xffffffff.
At line label: movdqa xmm0, xmmword ptr [t1+eax]
I'm trying to dynamically allocate t1 and t2 and according to this tutorial, I've used _mm_malloc:
#include <emmintrin.h>
int main(int argc, char* argv[])
{
int *t1, *t2;
const int n = 100000;
t1 = (int*)_mm_malloc(n*sizeof(int),16);
t2 = (int*)_mm_malloc(n*sizeof(int),16);
__m128i mul1, mul2;
for (int j = 0; j < n; j++)
{
t1[j] = j;
t2[j] = (j+1);
} // set temporary variables to random values
_asm
{
mov eax, 0
label: movdqa xmm0, xmmword ptr [t1+eax]
movdqa xmm1, xmmword ptr [t2+eax]
pmuludq xmm0, xmm1
movdqa mul1, xmm0
movdqa xmm0, xmmword ptr [t1+eax]
pshufd xmm0, xmm0, 05fh
pshufd xmm1, xmm1, 05fh
pmuludq xmm0, xmm1
movdqa mul2, xmm0
add eax, 16
cmp eax, 100000
jnge label
}
_mm_free(t1);
_mm_free(t2);
return 0;
}

I think the 2nd problem is that you're reading at an offset from the pointer variable (not an offset from what the pointer points to).
Change:
label: movdqa xmm0, xmmword ptr [t1+eax]
To something like:
mov ebx, [t1]
label: movdqa xmm0, xmmword ptr [ebx+eax]
And similarly for your accesses through the t2 pointer.
This might be even better (though I haven't had an opportunity to test it, so it might not even work):
_asm
{
mov eax, [t1]
mov ebx, [t1]
lea ecx, [eax + (100000*4)]
label: movdqa xmm0, xmmword ptr [eax]
movdqa xmm1, xmmword ptr [ebx]
pmuludq xmm0, xmm1
movdqa mul1, xmm0
movdqa xmm0, xmmword ptr [eax]
pshufd xmm0, xmm0, 05fh
pshufd xmm1, xmm1, 05fh
pmuludq xmm0, xmm1
movdqa mul2, xmm0
add eax, 16
add ebx, 16
cmp eax, ecx
jnge label
}

You're not allocating enough memory:
t1 = (int*)_mm_malloc(n * sizeof( int),16);
t2 = (int*)_mm_malloc(n * sizeof( int),16);

Perhaps:
t1 = (int*)_mm_malloc(n*sizeof(int),16);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Loop unroll issue with Visual Studio compiler - c++

Related

Does pointer to pointer dereference has any performance impact?

Differences in custom and std fetch_add on floats

Loop unrolling and SSE -- clang vs gcc

GCC std::sin vectorization bug?

SSE2 - 16-byte aligned dynamic allocation of memory

Categories

Resources