I tried to change this code to handle std::vector<int>.
float accumulate(const std::vector<float>& v)
{
// copy the length of v and a pointer to the data onto the local stack
const size_t N = v.size();
const float* p = (N > 0) ? &v.front() : NULL;
__m128 mmSum = _mm_setzero_ps();
size_t i = 0;
// unrolled loop that adds up 4 elements at a time
for(; i < ROUND_DOWN(N, 4); i+=4)
{
mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i));
}
// add up single values until all elements are covered
for(; i < N; i++)
{
mmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i));
}
// add up the four float values from mmSum into a single value and return
mmSum = _mm_hadd_ps(mmSum, mmSum);
mmSum = _mm_hadd_ps(mmSum, mmSum);
return _mm_cvtss_f32(mmSum);
}
Ref: http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html
I changed _mm_setzero_ps to _mm_setzero_si128, _mm_loadu_ps to mm_loadl_epi64 and _mm_add_ps to _mm_add_epi64.
I get this error:
error: cannot convert ‘const int*’ to ‘const __m128i* {aka const __vector(2) long long int*}’ for argument ‘1’ to ‘__m128i _mm_loadl_epi64(const __m128i*)’
mmSum = _mm_add_epi64(mmSum, _mm_loadl_epi64(p + i + 0));
I am novice in this field. Is there any good source to learn these things?
Here is an int version which I just threw together:
#include <iostream>
#include <vector>
#include <smmintrin.h> // SSE4
#define ROUND_DOWN(m, n) ((m) & ~((n) - 1))
static int accumulate(const std::vector<int>& v)
{
// copy the length of v and a pointer to the data onto the local stack
const size_t N = v.size();
const int* p = (N > 0) ? &v.front() : NULL;
__m128i mmSum = _mm_setzero_si128();
int sum = 0;
size_t i = 0;
// unrolled loop that adds up 4 elements at a time
for(; i < ROUND_DOWN(N, 4); i+=4)
{
mmSum = _mm_add_epi32(mmSum, _mm_loadu_si128((__m128i *)(p + i)));
}
// add up the four int values from mmSum into a single value
mmSum = _mm_hadd_epi32(mmSum, mmSum);
mmSum = _mm_hadd_epi32(mmSum, mmSum);
sum = _mm_extract_epi32(mmSum, 0);
// add up single values until all elements are covered
for(; i < N; i++)
{
sum += p[i];
}
return sum;
}
int main()
{
std::vector<int> v;
for (int i = 0; i < 10; ++i)
{
v.push_back(i);
}
int sum = accumulate(v);
std::cout << sum << std::endl;
return 0;
}
Compile and run:
$ g++ -Wall -msse4 -O3 accumulate.cpp && ./a.out
45
The ideal way to do this is to let the compiler auto-vectorize your code and keep your code simple and readable. You don't should not need anything more that
int sum = 0;
for(int i=0; i<v.size(); i++) sum += v[i];
The link you pointed to, http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html, does not seem to understand how to make the compiler vectorize the code.
For floating point, which is what that link uses, what you need to know is that floating point arithmetic is not associative and therefore depends on the order that you do the reduction. GCC, MSVC, and Clang will not do auto-vectorization for a reduction unless you tell it to use a different floating point model otherwise your result could depend on your hardware. ICC, however, defaults to associative floating point math so it will vectorize the code with e.g. -O3.
Not only will GCC, MSVC, and Clang not vectorize unless associative math is allowed but they won't unroll the loop to allow partial sums in order to overcome the latency of the summation. In this case only Clang and ICC will unroll to partial sums anyway. Clang unrolls four times and ICC twice.
One way to enable associative floating point arithmetic with GCC is with the -Ofast flag. With MSVC use /fp:fast
I tested the code below with GCC 4.9.2, XeonE5-1620 (IVB) # 3.60GHz, Ubuntu 15.04.
-O3 -mavx -fopenmp 0.93 s
-Ofast -mavx -fopenmp 0.19 s
-Ofast -mavx -fopenmp -funroll-loops 0.19 s
That's about a five times speed-up. Although, GCC does unroll the loop eight times it does not do independent partial sums (see the assembly below). This is the reason the unrolled version is no better.
I only used OpenMP for its convenient cross-platform/compiler timing function: omp_get_wtime().
Another advantage auto-vectorization has is it works for AVX simply by enabling a compiler switch (e.g. -mavx). Otherwise, if you wanted AVX, you would have to rewrite your code to use the AVX intrinsics and maybe have to ask another question on SO on how to do this.
So currently the only compiler which will auto-vectorize your loop as well as unroll to four partial sums is Clang. See the code and assembly at the end of this answer.
Here is the code I used to test the performance
#include <stdio.h>
#include <omp.h>
#include <vector>
float sumf(float *x, int n)
{
float sum = 0;
for(int i=0; i<n; i++) sum += x[i];
return sum;
}
#define N 10000 // the link used this value
int main(void)
{
std::vector<float> x;
for(int i=0; i<N; i++) x.push_back(1 -2*(i%2==0));
//float x[N]; for(int i=0; i<N; i++) x[i] = 1 -2*(i%2==0);
float sum = 0;
sum += sumf(x.data(),N);
double dtime = -omp_get_wtime();
for(int r=0; r<100000; r++) {
sum += sumf(x.data(),N);
}
dtime +=omp_get_wtime();
printf("sum %f time %f\n", sum, dtime);
}
Edit:
I should have taken my own advice and looked at the assembly.
The main loop for -O3. It's clear it only does a scalar sum.
.L3:
vaddss (%rdi), %xmm0, %xmm0
addq $4, %rdi
cmpq %rax, %rdi
jne .L3
The main loop for -Ofast. It does a vector sum but no unrolling.
.L8:
addl $1, %eax
vaddps (%r8), %ymm1, %ymm1
addq $32, %r8
cmpl %eax, %ecx
ja .L8
The main loop for -O3 -funroll-loops. Vector sum with 8x unroll
.L8:
vaddps (%rax), %ymm1, %ymm2
addl $8, %ebx
addq $256, %rax
vaddps -224(%rax), %ymm2, %ymm3
vaddps -192(%rax), %ymm3, %ymm4
vaddps -160(%rax), %ymm4, %ymm5
vaddps -128(%rax), %ymm5, %ymm6
vaddps -96(%rax), %ymm6, %ymm7
vaddps -64(%rax), %ymm7, %ymm8
vaddps -32(%rax), %ymm8, %ymm1
cmpl %ebx, %r9d
ja .L8
Edit:
Putting the following code in Clang 3.7 (-O3 -fverbose-asm -mavx)
float sumi(int *x)
{
x = (int*)__builtin_assume_aligned(x, 64);
int sum = 0;
for(int i=0; i<2048; i++) sum += x[i];
return sum;
}
produces the following assembly. Notice that it's vectorized to four independent partial sums.
sumi(int*): # #sumi(int*)
vpxor xmm0, xmm0, xmm0
xor eax, eax
vpxor xmm1, xmm1, xmm1
vpxor xmm2, xmm2, xmm2
vpxor xmm3, xmm3, xmm3
.LBB0_1: # %vector.body
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax]
vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 16]
vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 32]
vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 48]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax + 64]
vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 80]
vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 96]
vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 112]
add rax, 32
cmp rax, 2048
jne .LBB0_1
vpaddd xmm0, xmm1, xmm0
vpaddd xmm0, xmm2, xmm0
vpaddd xmm0, xmm3, xmm0
vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
vpaddd xmm0, xmm0, xmm1
vphaddd xmm0, xmm0, xmm0
vmovd eax, xmm0
vxorps xmm0, xmm0, xmm0
vcvtsi2ss xmm0, xmm0, eax
ret
static inline int32_t accumulate(const int32_t *data, size_t size) {
constexpr const static size_t batch = 256 / 8 / sizeof(int32_t);
int32_t sum = 0;
size_t pos = 0;
if (size >= batch) {
// 7
__m256i mmSum = _mm256_loadu_si256((__m256i *)(data));
pos = batch;
// unrolled loop
for (; pos + batch < size; pos += batch) {
// 1 + 7
mmSum =
_mm256_add_epi32(mmSum, _mm256_loadu_si256((__m256i *)(data + pos)));
}
mmSum = _mm256_hadd_epi32(mmSum, mmSum);
mmSum = _mm256_hadd_epi32(mmSum, mmSum);
// 2 + 1 + 3 + 0
sum = _mm_cvtsi128_si32(_mm_add_epi32(_mm256_extractf128_si256(mmSum, 1),
_mm256_castsi256_si128(mmSum)));
}
// add up remain values
while (pos < size) {
sum += data[pos++];
}
return sum;
}
Related
My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.
If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).
The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.
I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.
#include <iostream>
#include <immintrin.h>
int main() {
// example of manually unrolling float arrays
size_t bytes = sizeof(__m256) * 10;
size_t alignment = sizeof(__m256);
// 10 x 32-byte vectors
__m256* a = (__m256*) _mm_malloc(bytes, alignment);
__m256* b = (__m256*) _mm_malloc(bytes, alignment);
__m256* c = (__m256*) _mm_malloc(bytes, alignment);
for (int i = 0; i < 10; i += 2) {
// cache miss?
// load 2 x 64-byte cache lines:
// 2 x 32-byte vectors from b
// 2 x 32-byte vectors from c
a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);
// cache hit?
a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);
// special bonus for consuming whole cache lines?
}
}
Original source for 3 unique float arrays
for (int64_t i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
MSVC with AVX2 instructions
a[i] = b[i] + c[i];
00007FF7E2522370 vmovups ymm2,ymmword ptr [rax+rcx]
00007FF7E2522375 vmovups ymm1,ymmword ptr [rcx+rax-20h]
00007FF7E252237B vaddps ymm1,ymm1,ymmword ptr [rax-20h]
00007FF7E2522380 vmovups ymmword ptr [rdx+rax-20h],ymm1
00007FF7E2522386 vaddps ymm1,ymm2,ymmword ptr [rax]
00007FF7E252238A vmovups ymm2,ymmword ptr [rcx+rax+20h]
00007FF7E2522390 vmovups ymmword ptr [rdx+rax],ymm1
00007FF7E2522395 vaddps ymm1,ymm2,ymmword ptr [rax+20h]
00007FF7E252239A vmovups ymm2,ymmword ptr [rcx+rax+40h]
00007FF7E25223A0 vmovups ymmword ptr [rdx+rax+20h],ymm1
00007FF7E25223A6 vaddps ymm1,ymm2,ymmword ptr [rax+40h]
00007FF7E25223AB add r9,20h
00007FF7E25223AF vmovups ymmword ptr [rdx+rax+40h],ymm1
00007FF7E25223B5 lea rax,[rax+80h]
00007FF7E25223BC cmp r9,r10
00007FF7E25223BF jle main$omp$2+0E0h (07FF7E2522370h)
MSVC with default instructions
a[i] = b[i] + c[i];
00007FF71ECB2372 movups xmm0,xmmword ptr [rax-10h]
00007FF71ECB2376 add r9,10h
00007FF71ECB237A movups xmm1,xmmword ptr [rcx+rax-10h]
00007FF71ECB237F movups xmm2,xmmword ptr [rax+rcx]
00007FF71ECB2383 addps xmm1,xmm0
00007FF71ECB2386 movups xmm0,xmmword ptr [rax]
00007FF71ECB2389 addps xmm2,xmm0
00007FF71ECB238C movups xmm0,xmmword ptr [rax+10h]
00007FF71ECB2390 movups xmmword ptr [rdx+rax-10h],xmm1
00007FF71ECB2395 movups xmm1,xmmword ptr [rcx+rax+10h]
00007FF71ECB239A movups xmmword ptr [rdx+rax],xmm2
00007FF71ECB239E movups xmm2,xmmword ptr [rcx+rax+20h]
00007FF71ECB23A3 addps xmm1,xmm0
00007FF71ECB23A6 movups xmm0,xmmword ptr [rax+20h]
00007FF71ECB23AA addps xmm2,xmm0
00007FF71ECB23AD movups xmmword ptr [rdx+rax+10h],xmm1
00007FF71ECB23B2 movups xmmword ptr [rdx+rax+20h],xmm2
00007FF71ECB23B7 add rax,40h
00007FF71ECB23BB cmp r9,r10
00007FF71ECB23BE jle main$omp$2+0D2h (07FF71ECB2372h)
I have been playing around with SIMD OMP instructions and I am not getting the compiler to emit ANDPS in my scenario.
What I'm trying to do:
This is an implementation of this problem (tldr: find pair of users with a common friend). My approach is to pack 64 bits (whether somebody is a friend or not) into an unsigned long long.
My SIMD approach: Take AND between two vectors of relationship and reduce with a OR which nicely fits the reduction pattern of OMP.
g++ instructions (on a 2019 intel i-7 macbookPro):
g++-11 friends.cpp -S -O3 -fopenmp -fsanitize=address -Wshadow -Wall -march=native --std=c++17;
My implementation below
#include <vector>
#include <algorithm>
#include "iostream"
#include <cmath>
#include <numeric>
typedef long long ll;
typedef unsigned long long ull;
using namespace std;
ull find_sol(vector<vector<ull>> & input_data, int q) {
bool not_friend = false;
ull cnt = 0;
int size_arr = (int) input_data[0].size();
for (int i = 0; i < q; ++i) // from these friends
{
for (int j = i+1; j < q; ++j) // to these friends
{
int step = j/64;
int remainder = j - 64*step;
not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;
if(not_friend){
bool counter = false;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
__asm__ ("entry");
counter |= (v1[c] & v2[c])>0;
__asm__ ("exit");
}
if(counter>0)
cnt++;
}
}
}
return cnt << 1;
}
int main(){
int q;
cin >> q;
vector<vector<ull>> input_data(q,vector<ull>(1 + q/64,0ULL));
for (int i = 0; i < q; ++i)
{
string s;
cin >> s;
for (int j = 0; j < 1 + q/64; ++j)
{
string str = s.substr(j*64,64);
reverse(str.begin(),str.end());
ull ul = std::stoull(str,nullptr,2);
input_data.at(i).at(j) = ul;
}
}
cout << find_sol(input_data,q) << endl;
}
Looking at the assembly inside the loop, I would expect some SIMD instructions (specifically andps) but I can't see them. What's preventing my compiler to emit them? Also, is there a way for the compiler to emit a warning re:what's wrong (would be very helpful)?
entry
# 0 "" 2
cmpb $0, (%rbx)
jne L53
movq (%r8), %rdx
leaq 0(,%rax,8), %rdi
addq %rdi, %rdx
movq %rdx, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L54
cmpb $0, (%r11)
movq (%rdx), %rdx
jne L55
addq (%r9), %rdi
movq %rdi, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L56
andq (%rdi), %rdx
movzbl (%r12), %edx
setne %dil
cmpb %r13b, %dl
jg L21
testb %dl, %dl
jne L57
L21:
orb %dil, -32(%r10)
EDIT 1:
Following Peter 1st and 2nd suggestion, I moved the marker out of the loop and I replaced the binarization by a simple OR. I'm still not getting SIMD instructions though:
ull counter = 0;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
__asm__ ("entry" :::);
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
counter |= v1[c] & v2[c];
}
__asm__ ("exit" :::);
if(counter!=0)
cnt++;
First problem: asm. In recent GCC, non-empty Basic Asm statements like __asm__ ("entry"); have an implicit ::: "memory" clobber, making it impossible for the compiler to combine array accesses across iterations. Maybe try __asm__ ("entry" :::); if you really want these markers. (Extended asm without a memory clobber).
Or better, use better tools for looking at compiler output, such as the Godbolt compiler explorer (https://godbolt.org/) which lets you right click on a source line and go to the corresponding asm. (Optimization can make this a bit wonky, so sometimes you have to find the asm and mouseover it to make sure it comes from that source line.)
See How to remove "noise" from GCC/clang assembly output?
Second problem: -fsanitize=address makes it harder for the compiler to optimize. I only looked at GCC output without that option.
Vectorizing the OR reduction
After fixing those showstoppers:
You're forcing the compiler to booleanize to an 8-bit bool inside the inner loop, instead of just reducing the integer AND results with |= into a variable of the same type. (Which you check once after the loop.) This is probably part of why GCC has a hard time; it often makes a mess with different-sized integer types when it vectorizes at all.
(v1[c] & v2[c]) > 0; would need SSE4.1 pcmpeqqvs. just SIMD OR in the loop and check counter for !=0 after the loop. (You had bool counter, which was really surprising given counter>0 as a semantically weird way to check an unsigned value for non-zero. Even more unexpected for a bool.)
After changing that, GCC auto-vectorizes the way I expected without OpenMP, if you use -O3 (which includes -ftree-vectorize). It of course uses with vpand, not vandps, since FP booleans have lower throughput on some CPUs. (You didn't say what -march=native is for you; if you only had AVX1, e.g. on Sandybridge, then vandps is plausible.)
ull counter = 0;
// #pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
//__asm__ ("entry");
counter |= (v1[c] & v2[c]);
//__asm__ ("exit");
}
if(counter != 0)
cnt++;
From the Godbolt compiler explorer (which you should use instead of littering your code with asm statements)
# g++ 11.2 -O3 -march=skylake **without** OpenMP
.L7: # the vector part of the inner-most loop
vmovdqu ymm2, YMMWORD PTR [rsi+rax]
vpand ymm0, ymm2, YMMWORD PTR [rcx+rax]
add rax, 32
vpor ymm1, ymm1, ymm0
cmp rax, r8
jne .L7
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
... (horizontal OR reduction of that one SIMD vector, eventually vmovq to RAX)
GCC OpenMP does vectorize, but badly / weirdly
With OpenMP, there is a vectorized version of the loop, but it sucks a lot, doing shuffles and gather loads, and storing results into a local buffer which it later reads. I don't know OpenMP that well, but unless you're using it wrong, this is a major missed optimization. Possibly it's scaling a loop counter with multiplies instead of incrementing a pointer, which is just horrible.
(Godbolt)
# g++ 11.2 -Wall -O3 -fopenmp -march=skylake -std=gnu++17
# with the #pragma uncommented
.L10:
vmovdqa ymm0, ymm3
vpermq ymm0, ymm0, 216
vpshufd ymm1, ymm0, 80 # unpack for 32x32 => 64-bit multiplies?
vpmuldq ymm1, ymm1, ymm4
vpshufd ymm0, ymm0, 250
vpmuldq ymm0, ymm0, ymm4
vmovdqa ymm7, ymm6 # ymm6 = set1(-1) outside the loop, gather mask
add rsi, 64
vpaddq ymm1, ymm1, ymm5
vpgatherqq ymm2, QWORD PTR [0+ymm1*1], ymm7
vpaddq ymm0, ymm0, ymm5
vmovdqa ymm7, ymm6
vpgatherqq ymm1, QWORD PTR [0+ymm0*1], ymm7
vpand ymm0, ymm1, YMMWORD PTR [rsi-32] # memory source = one array
vpand ymm1, ymm2, YMMWORD PTR [rsi-64]
vpor ymm0, ymm0, YMMWORD PTR [rsp+64] # OR with old contents of local buffer
vpor ymm1, ymm1, YMMWORD PTR [rsp+32]
vpaddd ymm3, ymm3, ymm4
vmovdqa YMMWORD PTR [rsp+32], ymm1 # and store back into it.
vmovdqa YMMWORD PTR [rsp+64], ymm0
cmp r9, rsi
jne .L10
mov edi, DWORD PTR [rsp+16] # outer loop tail
cmp DWORD PTR [rsp+20], edi
je .L7
This buffer of 64 bytes is read at the top of .L7 (an outer loop)
.L7:
vmovdqa ymm2, YMMWORD PTR [rsp+32]
vpor ymm1, ymm2, YMMWORD PTR [rsp+64]
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
vpor xmm0, xmm0, xmm1
vmovq rsi, xmm0
cmp rsi, 1 # sets CF unless RSI=0
sbb r13, -1 # R13 -= -1 +CF i.e. increment if CF=0
IDK if there's a way to hand-hold the compiler into making better asm; perhaps with pointer-width loop counters?
GCC5.4 -O3 -fopenmp -march=haswell -std=gnu++17 makes sane asm, with just vpand / vpor and an array index increment in the loop. The stuff outside the loop is a bit different with OpenMP vs. plain vectorization, with OpenMP using vector store / scalar reload for the horizontal OR reduction of the final vector.
I have the following C/C++ code:
#define SIZE 2
typedef struct vec {
float data[SIZE];
} vec;
vec add(vec a, vec b) {
vec result;
for (size_t i = 0; i < SIZE; ++i) {
result.data[i] = a.data[i] + b.data[i];
}
return result;
}
I was wondering how clang would optimize this vector addition and the compiler output surprised me, as it looks quite unoptimal. This is at -O3 and with -march=skylake. (Godbolt with clang 10.1)
add(vec, vec):
vaddss xmm2, xmm0, xmm1 # res[0] = a[0] + b[0]
vmovss dword ptr [rsp - 8], xmm2 # mem[1] = res[0]
vmovshdup xmm0, xmm0 # a[0] = a[1]
vmovshdup xmm1, xmm1 # b[0] = b[1]
vaddss xmm0, xmm0, xmm1 # a[0] = a[0] + b[0]
vmovss dword ptr [rsp - 4], xmm0 # mem[0] = a[0]
vmovsd xmm0, qword ptr [rsp - 8] # xmm0 = mem[0],mem[1],zero,zero
ret
From the looks of it, a and b are stored in xmm0 and xmm1 respectively. However, only the lowest single-precision float in these registers is being used for addition. This leads to two separate additions. Why isn't vaddps used instead, which would allow for adding both values simultaneously?
The only thing I could come up with is that clang tries to preserve the higher two floats in the xmm registers. This is why I also tried increasing SIZE to 4, but now I get:
add(vec, vec):
vaddps xmm0, xmm0, xmm2
vaddps xmm1, xmm1, xmm3
vmovlhps xmm0, xmm0, xmm1
vmovaps xmmword ptr [rsp - 24], xmm0
vmovsd xmm0, qword ptr [rsp - 24]
vmovsd xmm1, qword ptr [rsp - 16]
ret
So for whatever reason, clang now doesn't even use the highest two floats and spreads the vectors between xmm0 to xmm3. An xmm register is 128 bits large, so it should be able to fit all four floats. Then this code would be much simpler and only a single addition would be necessary.
(See Compiler Explorer)
Have a look at this example I constructed for a 4D dot product:
#pragma omp declare simd
double dot(double x0, double y0, double z0, double w0, double x1, double y1, double z1, double w1)
{
return x0 * x1 + y0 * y1 + z0 * z1 + w0 * w1;
}
#define SIMD 4
int main(int argc, char **argv)
{
double x[SIMD];
double y[SIMD];
double z[SIMD];
double w[SIMD];
double r[SIMD];
for (int i = 0; i < SIMD; i++)
{
x[i] = y[i] = z[i] = 1;
w[i] = 0;
}
#pragma omp simd
for (int i = 0; i < SIMD; i++)
{
r[i] = dot(x[i], y[i], z[i], w[i], x[i], y[i], z[i], w[i]);
}
double s = 0;
for (int i = 0; i < SIMD; i++)
{
s += r[i];
}
return s;
}
In the compiler output you can see that it generates a few functions called _XXXXXXvvvvvvvv_dot. I assume that these are the functions used for different lengths of input for the dot function, or at least that is what they are supposed to be. However, these function do not seem to be actually used by the compiler. Line 94 of the output reads call dot(…). Does that call one of these functions? What do I have to do to use them?
Don't try to call the SIMD versions manually: let the compiler do that from a loop that it's auto-vectorizing.
You didn't enable optimization so GCC doesn't auto-vectorize your loops. Thus it only calls the scalar version of the function.
The GCC default is -O0 - anti-optimize for debugging, so of course the code is total garbage and not actually auto-vectorized (no addpd or mulpd instructions).
Enable optimization with -O3. GCC will simply inline the calls when it can see the definition. The #pragma omp declare simd thing lets the compiler emit calls to vectorized versions of the function even if it can't see the definition. (Or for larger functions that it chooses not to inline.)
You can use __attribute__((noinline)) on dot to see how it works even for your small function:
On Godbolt with GCC9.1 -O3 -fopenmp, with that change:
# gcc9.1 -O3 -fopenmp
main:
sub rsp, 40
movapd xmm0, XMMWORD PTR .LC0[rip] # {1, 1}
pxor xmm7, xmm7 # {0, 0}
movapd xmm3, xmm7
movapd xmm6, xmm0 # duplicate the 1,1 vector for several args
movapd xmm5, xmm0
movapd xmm4, xmm0
movapd xmm2, xmm0
movapd xmm1, xmm0
call _ZGVbN2vvvvvvvv_dot(double, double, double, double, double, double, double, double)
movaps XMMWORD PTR [rsp], xmm0 # store to the stack
movaps XMMWORD PTR [rsp+16], xmm0 # twice
pxor xmm0, xmm0 # 0.0
addsd xmm0, QWORD PTR [rsp] # 0 + v[0]
addsd xmm0, QWORD PTR [rsp+8] # ... += v[1]
addsd xmm0, QWORD PTR [rsp+16]
addsd xmm0, QWORD PTR [rsp+24] # stupid inefficient horizontal sum
add rsp, 40
cvttsd2si eax, xmm0 # truncate to integer as main's return value
ret
With your tiny #define SIMD 4, main doesn't actually need to loop at all, just two 16-byte vectors is sufficient. The arrays with compile-time-constant initializers get optimized away; GCC just materializes the constants into registers with pxor-zeroing for 0.0 and loading + copying from static constant data for 1.0.
So anyway, there's only one call to a SIMD version of dot(), but this is it. I think GCC knows that the same call will give the same result, which is why it calls once but stores the result twice.
IDK why GCC's OpenMP horizontal sum is so dumb. Obviously it would be better to addpd xmm0,xmm0 instead of storing it twice, and a shuffle could avoid a store/reload. Also using an addsd to do 0.0 + x is pointless; just use the low element of the register that you stored from.
The scalar version of dot() has the usual C++ name mangling for a function. The other versions have special name-mangling conventions, maybe specific to GCC's OpenMP, IDK.
Interestingly, gcc makes a few different versions of dot, including an AVX version using YMM registers. And some that spill to the stack and use scalar math in a loop; IDK why those exist.
So I guess that means that even if you compile this source file without -march=skylake-avx512, another loop that is compiled that way can still emit a call to _ZGVeN8vvvvvvvv_dot and get the AVX512 definition:
_ZGVeN8vvvvvvvv_dot(double, double, double, double, double, double, double, double):
vmulpd zmm1, zmm1, zmm5
vfmadd132pd zmm0, zmm1, zmm4
vfmadd231pd zmm0, zmm2, zmm6
vfmadd231pd zmm0, zmm3, zmm7
Strangely I don't see an AVX+FMA definition that uses FMA on YMM regs, only SSE2 and AVX definitions that use vmulpd / vaddpd.
The first version does an optimisation by moving a value from memory to a local variable. The second version does not.
I was expecting the compiler might choose to do the localValue optimisation here anyway and not read and write the value from memory for each iteration of the loop. Why doesn't it?
class Example
{
public:
void processSamples(float * x, int num)
{
float localValue = v1;
for (int i = 0; i < num; ++i)
{
x[i] = x[i] + localValue;
localValue = 0.5 * x[i];
}
v1 = localValue;
}
void processSamples2(float * x, int num)
{
for (int i = 0; i < num; ++i)
{
x[i] = x[i] + v1;
v1 = 0.5 * x[i];
}
}
float v1;
};
processSamples assembles to code like this:
.L4:
addss xmm0, DWORD PTR [rax]
movss DWORD PTR [rax], xmm0
mulss xmm0, xmm1
add rax, 4
cmp rax, rcx
jne .L4
processSamples2 to this:
.L5:
movss xmm0, DWORD PTR [rax]
addss xmm0, DWORD PTR example[rip]
movss DWORD PTR [rax], xmm0
mulss xmm0, xmm1
movss DWORD PTR example[rip], xmm0
add rax, 4
cmp rax, rdx
jne .L5
As the compiler doesn't have to worry about threads (v1 isn't atomic). Can't it just assume nothing else will be looking at this value and go ahead and keep it in a register while the loop is spinning?
See https://godbolt.org/g/RiF3B4 for the full assembly and a selection of compilers to choose from!
Because of aliasing: v1 is a member variable, and it could be that x points at it. Thus, one of the writes to the elements of x might change v1.
In C99, you can use the restrict keyword on a function argument of pointer type to inform the compiler that it doesn't alias anything else that is in the scope of the function. Some C++ compilers also support it, although it is not standard.
(Copied from one of my comments.)