I have been playing around with SIMD OMP instructions and I am not getting the compiler to emit ANDPS in my scenario.
What I'm trying to do:
This is an implementation of this problem (tldr: find pair of users with a common friend). My approach is to pack 64 bits (whether somebody is a friend or not) into an unsigned long long.
My SIMD approach: Take AND between two vectors of relationship and reduce with a OR which nicely fits the reduction pattern of OMP.
g++ instructions (on a 2019 intel i-7 macbookPro):
g++-11 friends.cpp -S -O3 -fopenmp -fsanitize=address -Wshadow -Wall -march=native --std=c++17;
My implementation below
#include <vector>
#include <algorithm>
#include "iostream"
#include <cmath>
#include <numeric>
typedef long long ll;
typedef unsigned long long ull;
using namespace std;
ull find_sol(vector<vector<ull>> & input_data, int q) {
bool not_friend = false;
ull cnt = 0;
int size_arr = (int) input_data[0].size();
for (int i = 0; i < q; ++i) // from these friends
{
for (int j = i+1; j < q; ++j) // to these friends
{
int step = j/64;
int remainder = j - 64*step;
not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;
if(not_friend){
bool counter = false;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
__asm__ ("entry");
counter |= (v1[c] & v2[c])>0;
__asm__ ("exit");
}
if(counter>0)
cnt++;
}
}
}
return cnt << 1;
}
int main(){
int q;
cin >> q;
vector<vector<ull>> input_data(q,vector<ull>(1 + q/64,0ULL));
for (int i = 0; i < q; ++i)
{
string s;
cin >> s;
for (int j = 0; j < 1 + q/64; ++j)
{
string str = s.substr(j*64,64);
reverse(str.begin(),str.end());
ull ul = std::stoull(str,nullptr,2);
input_data.at(i).at(j) = ul;
}
}
cout << find_sol(input_data,q) << endl;
}
Looking at the assembly inside the loop, I would expect some SIMD instructions (specifically andps) but I can't see them. What's preventing my compiler to emit them? Also, is there a way for the compiler to emit a warning re:what's wrong (would be very helpful)?
entry
# 0 "" 2
cmpb $0, (%rbx)
jne L53
movq (%r8), %rdx
leaq 0(,%rax,8), %rdi
addq %rdi, %rdx
movq %rdx, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L54
cmpb $0, (%r11)
movq (%rdx), %rdx
jne L55
addq (%r9), %rdi
movq %rdi, %r15
shrq $3, %r15
cmpb $0, (%r15,%rcx)
jne L56
andq (%rdi), %rdx
movzbl (%r12), %edx
setne %dil
cmpb %r13b, %dl
jg L21
testb %dl, %dl
jne L57
L21:
orb %dil, -32(%r10)
EDIT 1:
Following Peter 1st and 2nd suggestion, I moved the marker out of the loop and I replaced the binarization by a simple OR. I'm still not getting SIMD instructions though:
ull counter = 0;
vector<ull> & v1 = input_data[i];
vector<ull> & v2 = input_data[j];
__asm__ ("entry" :::);
#pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
counter |= v1[c] & v2[c];
}
__asm__ ("exit" :::);
if(counter!=0)
cnt++;
First problem: asm. In recent GCC, non-empty Basic Asm statements like __asm__ ("entry"); have an implicit ::: "memory" clobber, making it impossible for the compiler to combine array accesses across iterations. Maybe try __asm__ ("entry" :::); if you really want these markers. (Extended asm without a memory clobber).
Or better, use better tools for looking at compiler output, such as the Godbolt compiler explorer (https://godbolt.org/) which lets you right click on a source line and go to the corresponding asm. (Optimization can make this a bit wonky, so sometimes you have to find the asm and mouseover it to make sure it comes from that source line.)
See How to remove "noise" from GCC/clang assembly output?
Second problem: -fsanitize=address makes it harder for the compiler to optimize. I only looked at GCC output without that option.
Vectorizing the OR reduction
After fixing those showstoppers:
You're forcing the compiler to booleanize to an 8-bit bool inside the inner loop, instead of just reducing the integer AND results with |= into a variable of the same type. (Which you check once after the loop.) This is probably part of why GCC has a hard time; it often makes a mess with different-sized integer types when it vectorizes at all.
(v1[c] & v2[c]) > 0; would need SSE4.1 pcmpeqqvs. just SIMD OR in the loop and check counter for !=0 after the loop. (You had bool counter, which was really surprising given counter>0 as a semantically weird way to check an unsigned value for non-zero. Even more unexpected for a bool.)
After changing that, GCC auto-vectorizes the way I expected without OpenMP, if you use -O3 (which includes -ftree-vectorize). It of course uses with vpand, not vandps, since FP booleans have lower throughput on some CPUs. (You didn't say what -march=native is for you; if you only had AVX1, e.g. on Sandybridge, then vandps is plausible.)
ull counter = 0;
// #pragma omp simd reduction(|:counter)
for (int c = 0; c < size_arr; ++c)
{
//__asm__ ("entry");
counter |= (v1[c] & v2[c]);
//__asm__ ("exit");
}
if(counter != 0)
cnt++;
From the Godbolt compiler explorer (which you should use instead of littering your code with asm statements)
# g++ 11.2 -O3 -march=skylake **without** OpenMP
.L7: # the vector part of the inner-most loop
vmovdqu ymm2, YMMWORD PTR [rsi+rax]
vpand ymm0, ymm2, YMMWORD PTR [rcx+rax]
add rax, 32
vpor ymm1, ymm1, ymm0
cmp rax, r8
jne .L7
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
... (horizontal OR reduction of that one SIMD vector, eventually vmovq to RAX)
GCC OpenMP does vectorize, but badly / weirdly
With OpenMP, there is a vectorized version of the loop, but it sucks a lot, doing shuffles and gather loads, and storing results into a local buffer which it later reads. I don't know OpenMP that well, but unless you're using it wrong, this is a major missed optimization. Possibly it's scaling a loop counter with multiplies instead of incrementing a pointer, which is just horrible.
(Godbolt)
# g++ 11.2 -Wall -O3 -fopenmp -march=skylake -std=gnu++17
# with the #pragma uncommented
.L10:
vmovdqa ymm0, ymm3
vpermq ymm0, ymm0, 216
vpshufd ymm1, ymm0, 80 # unpack for 32x32 => 64-bit multiplies?
vpmuldq ymm1, ymm1, ymm4
vpshufd ymm0, ymm0, 250
vpmuldq ymm0, ymm0, ymm4
vmovdqa ymm7, ymm6 # ymm6 = set1(-1) outside the loop, gather mask
add rsi, 64
vpaddq ymm1, ymm1, ymm5
vpgatherqq ymm2, QWORD PTR [0+ymm1*1], ymm7
vpaddq ymm0, ymm0, ymm5
vmovdqa ymm7, ymm6
vpgatherqq ymm1, QWORD PTR [0+ymm0*1], ymm7
vpand ymm0, ymm1, YMMWORD PTR [rsi-32] # memory source = one array
vpand ymm1, ymm2, YMMWORD PTR [rsi-64]
vpor ymm0, ymm0, YMMWORD PTR [rsp+64] # OR with old contents of local buffer
vpor ymm1, ymm1, YMMWORD PTR [rsp+32]
vpaddd ymm3, ymm3, ymm4
vmovdqa YMMWORD PTR [rsp+32], ymm1 # and store back into it.
vmovdqa YMMWORD PTR [rsp+64], ymm0
cmp r9, rsi
jne .L10
mov edi, DWORD PTR [rsp+16] # outer loop tail
cmp DWORD PTR [rsp+20], edi
je .L7
This buffer of 64 bytes is read at the top of .L7 (an outer loop)
.L7:
vmovdqa ymm2, YMMWORD PTR [rsp+32]
vpor ymm1, ymm2, YMMWORD PTR [rsp+64]
vextracti128 xmm0, ymm1, 0x1
vpor xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
vpor xmm0, xmm0, xmm1
vmovq rsi, xmm0
cmp rsi, 1 # sets CF unless RSI=0
sbb r13, -1 # R13 -= -1 +CF i.e. increment if CF=0
IDK if there's a way to hand-hold the compiler into making better asm; perhaps with pointer-width loop counters?
GCC5.4 -O3 -fopenmp -march=haswell -std=gnu++17 makes sane asm, with just vpand / vpor and an array index increment in the loop. The stuff outside the loop is a bit different with OpenMP vs. plain vectorization, with OpenMP using vector store / scalar reload for the horizontal OR reduction of the final vector.
Related
My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.
If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).
The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.
I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.
#include <iostream>
#include <immintrin.h>
int main() {
// example of manually unrolling float arrays
size_t bytes = sizeof(__m256) * 10;
size_t alignment = sizeof(__m256);
// 10 x 32-byte vectors
__m256* a = (__m256*) _mm_malloc(bytes, alignment);
__m256* b = (__m256*) _mm_malloc(bytes, alignment);
__m256* c = (__m256*) _mm_malloc(bytes, alignment);
for (int i = 0; i < 10; i += 2) {
// cache miss?
// load 2 x 64-byte cache lines:
// 2 x 32-byte vectors from b
// 2 x 32-byte vectors from c
a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);
// cache hit?
a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);
// special bonus for consuming whole cache lines?
}
}
Original source for 3 unique float arrays
for (int64_t i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
MSVC with AVX2 instructions
a[i] = b[i] + c[i];
00007FF7E2522370 vmovups ymm2,ymmword ptr [rax+rcx]
00007FF7E2522375 vmovups ymm1,ymmword ptr [rcx+rax-20h]
00007FF7E252237B vaddps ymm1,ymm1,ymmword ptr [rax-20h]
00007FF7E2522380 vmovups ymmword ptr [rdx+rax-20h],ymm1
00007FF7E2522386 vaddps ymm1,ymm2,ymmword ptr [rax]
00007FF7E252238A vmovups ymm2,ymmword ptr [rcx+rax+20h]
00007FF7E2522390 vmovups ymmword ptr [rdx+rax],ymm1
00007FF7E2522395 vaddps ymm1,ymm2,ymmword ptr [rax+20h]
00007FF7E252239A vmovups ymm2,ymmword ptr [rcx+rax+40h]
00007FF7E25223A0 vmovups ymmword ptr [rdx+rax+20h],ymm1
00007FF7E25223A6 vaddps ymm1,ymm2,ymmword ptr [rax+40h]
00007FF7E25223AB add r9,20h
00007FF7E25223AF vmovups ymmword ptr [rdx+rax+40h],ymm1
00007FF7E25223B5 lea rax,[rax+80h]
00007FF7E25223BC cmp r9,r10
00007FF7E25223BF jle main$omp$2+0E0h (07FF7E2522370h)
MSVC with default instructions
a[i] = b[i] + c[i];
00007FF71ECB2372 movups xmm0,xmmword ptr [rax-10h]
00007FF71ECB2376 add r9,10h
00007FF71ECB237A movups xmm1,xmmword ptr [rcx+rax-10h]
00007FF71ECB237F movups xmm2,xmmword ptr [rax+rcx]
00007FF71ECB2383 addps xmm1,xmm0
00007FF71ECB2386 movups xmm0,xmmword ptr [rax]
00007FF71ECB2389 addps xmm2,xmm0
00007FF71ECB238C movups xmm0,xmmword ptr [rax+10h]
00007FF71ECB2390 movups xmmword ptr [rdx+rax-10h],xmm1
00007FF71ECB2395 movups xmm1,xmmword ptr [rcx+rax+10h]
00007FF71ECB239A movups xmmword ptr [rdx+rax],xmm2
00007FF71ECB239E movups xmm2,xmmword ptr [rcx+rax+20h]
00007FF71ECB23A3 addps xmm1,xmm0
00007FF71ECB23A6 movups xmm0,xmmword ptr [rax+20h]
00007FF71ECB23AA addps xmm2,xmm0
00007FF71ECB23AD movups xmmword ptr [rdx+rax+10h],xmm1
00007FF71ECB23B2 movups xmmword ptr [rdx+rax+20h],xmm2
00007FF71ECB23B7 add rax,40h
00007FF71ECB23BB cmp r9,r10
00007FF71ECB23BE jle main$omp$2+0D2h (07FF71ECB2372h)
This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses.
Can it be done with AVX2 or below (I don't have AVX512)?
[1] How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)
My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast (vpbroadcastq zmm0{k1}, rax). But it's actually not all that bad using a scratch register, about the same cost as a vpinsrq + an immediate blend.
(On Intel, 3 uops total. 2 uops for port 5 (vmovq + broadcast), and an immediate blend that can run on any port.
See https://agner.org/optimize/).
I updated my answer there with asm for this. In C++ with Intel's intrinsics, you'd do something like:
#include <immintrin.h>
#include <stdint.h>
// integer version. An FP version would still use _mm256_set1_epi64x, then a cast
template<unsigned elem>
static inline
__m256i merge_epi64(__m256i v, int64_t newval)
{
static_assert(elem <= 3, "a __m256i only has 4 qword elements");
__m256i splat = _mm256_set1_epi64x(newval);
constexpr unsigned dword_blendmask = 0b11 << (elem*2); // vpblendd uses 2 bits per qword
return _mm256_blend_epi32(v, splat, dword_blendmask);
}
Clang compiles this nearly perfectly efficiently for all 4 possible element positions, which really shows off how nice its shuffle optimizer is. It takes advantage of all the special cases. And as a bonus, it comments its asm to show you which elements come from where in blends and shuffles.
From the Godbolt compiler explorer, some test functions to see what happens with args in regs.
__m256i merge3(__m256i v, int64_t newval) {
return merge_epi64<3> (v, newval);
}
// and so on for 2..0
# clang7.0 -O3 -march=haswell
merge3(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq ymm1, xmm1
vpblendd ymm0, ymm0, ymm1, 192 # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
# 192 = 0xC0 = 0b11000000
ret
merge2(long long __vector(4), long):
vmovq xmm1, rdi
vinserti128 ymm1, ymm0, xmm1, 1 # Runs on more ports than vbroadcast on AMD Ryzen
# But it introduced a dependency on v (ymm0) before the blend for no reason, for the low half of ymm1. Could have used xmm1, xmm1.
vpblendd ymm0, ymm0, ymm1, 48 # ymm0 = ymm0[0,1,2,3],ymm1[4,5],ymm0[6,7]
ret
merge1(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq xmm1, xmm1 # only an *XMM* broadcast, 1c latency instead of 3.
vpblendd ymm0, ymm0, ymm1, 12 # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5,6,7]
ret
merge0(long long __vector(4), long):
vmovq xmm1, rdi
# broadcast optimized away, newval is already in the low element
vpblendd ymm0, ymm0, ymm1, 3 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
ret
Other compilers blindly broadcast to the full YMM and then blend, even for elem=0. You can specialize the template, or add if() conditions in the template that will optimize away. e.g. splat = (elem?) set1() : v; to save the broadcast for elem==0. You could capture the other optimizations, too, if you wanted.
GCC 8.x and earlier use a normally-bad way of broadcasting the integer: they store/reload. This avoids using any ALU shuffle ports because broadcast-loads are free on Intel CPUs, but it introduces store-forwarding latency into the chain from the integer to the final vector result.
This is fixed in current trunk for gcc9, but I don't know if there's a workaround to get non-silly code-gen with earlier gcc. Normally -march=<an intel uarch> favours ALU instead of store/reload for integer -> vector and vice versa, but in this case the cost model still picks store/reload with -march=haswell.
# gcc8.2 -O3 -march=haswell
merge0(long long __vector(4), long):
push rbp
mov rbp, rsp
and rsp, -32 # align the stack even though no YMM is spilled/loaded
mov QWORD PTR [rsp-8], rdi
vpbroadcastq ymm1, QWORD PTR [rsp-8] # 1 uop on Intel
vpblendd ymm0, ymm0, ymm1, 3
leave
ret
; GCC trunk: g++ (GCC-Explorer-Build) 9.0.0 20190103 (experimental)
; MSVC and ICC do this, too. (For MSVC, make sure to compile with -arch:AVX2)
merge0(long long __vector(4), long):
vmovq xmm2, rdi
vpbroadcastq ymm1, xmm2
vpblendd ymm0, ymm0, ymm1, 3
ret
For a runtime-variable element position, the shuffle still works but you'd have to create a blend mask vector with the high bit set in the right element. e.g. with a vpmovsxbq load from mask[3-elem] in alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };. But vpblendvb or vblendvpd is slower than an immediate blend, especially on Haswell, so avoid that if possible.
Consider the following small search function:
template <uint32_t N>
int32_t countsearch(const uint32_t *base, uint32_t needle) {
uint32_t count = 0;
#pragma clang loop vectorize(disable)
for (const uint32_t *probe = base; probe < base + N; probe++) {
if (*probe < needle)
count++;
}
return count;
}
At -O2 or higher, clang vectorizes this search, e.g,. resulting in code like this (for 10 elements):
int countsearch<10u>(unsigned int const*, unsigned int): # #int countsearch<10u>(unsigned int const*, unsigned int)
vmovd xmm0, esi
vpbroadcastd ymm0, xmm0
vpbroadcastd ymm1, dword ptr [rip + .LCPI0_0] # ymm1 = [2147483648,2147483648,2147483648,2147483648,2147483648,2147483648,2147483648,2147483648]
vpxor ymm2, ymm1, ymmword ptr [rdi]
vpxor ymm0, ymm0, ymm1
vpcmpgtd ymm0, ymm0, ymm2
cmp dword ptr [rdi + 32], esi
vpsrld ymm1, ymm0, 31
vextracti128 xmm1, ymm1, 1
vpsubd ymm0, ymm1, ymm0
vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
vpaddd ymm0, ymm0, ymm1
vphaddd ymm0, ymm0, ymm0
vmovd eax, xmm0
adc eax, 0
cmp dword ptr [rdi + 36], esi
adc eax, 0
vzeroupper
ret
How can I disable this vectorization on the command line or using a #pragma in the code?
I tried the following command line arguments, none of which prevented the vectorization:
-disable-loop-vectorization
-disable-vectorization
-fno-vectorize
-fno-tree-vectorize
I also tried #pragma clang loop vectorize(disable) above the loop as you seen in the code above, without luck.
Turn off SLP Vectorization:
clang++ -O2 -fno-slp-vectorize
Godbolt Link
Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register?
The goal is to get the XOR of all 4 64-bit components of an AVX register. It would essentially be doing the same thing as a horizontal add (_mm256_hadd_epi32()), except that I want to XOR instead of ADD.
The scalar code is:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.
MSVC already generates pretty good code for the scalar function that you show as an example in the question:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
Assuming that t is in ymm1, the resulting disassembly will be something like this:
vextractf128 xmm0, ymm1, 1
vpextrq rax, xmm0, 1
vmovq rcx, xmm1
xor rax, rcx
vpextrq rcx, xmm1, 1
vextractf128 xmm0, ymm1, 1
xor rax, rcx
vmovq rcx, xmm0
xor rax, rcx
…with the result left in RAX. If this accurately reflects what you need (a scalar uint64_t result), then this code would be sufficient.
You can slightly improve it by using intrinsics:
inline uint64_t _mm256_hxor_epu64(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return (uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1));
}
Then you'll get the following disassembly (again, assuming that x is in ymm1):
vextracti128 xmm2, ymm1, 1
vpextrq rcx, xmm2, 1
vpextrq rax, xmm1, 1
xor rax, rcx
vmovq rcx, xmm1
xor rax, rcx
vmovq rcx, xmm2
xor rax, rcx
Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128 was used instead of VEXTRACTF128 (although, this choice probably does not matter).
You'll see similar output on other compilers. For example, here's GCC 7.1 (with x assumed to be in ymm0):
vextracti128 xmm2, ymm0, 0x1
vpextrq rax, xmm0, 1
vmovq rdx, xmm2
vpextrq rcx, xmm2, 1
xor rax, rdx
vmovq rdx, xmm0
xor rax, rdx
xor rax, rcx
The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:
vmovq rax, xmm0
vpextrq rcx, xmm0, 1
xor rcx, rax
vextracti128 xmm0, ymm0, 1
vmovq rdx, xmm0
xor rdx, rcx
vpextrq rax, xmm0, 1
xor rax, rdx
And, of course, this ordering is always subject to change when the code gets inlined.
On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:
inline __m256i _mm256_hxor(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return _mm256_set1_epi64x((uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}
But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.
Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.
If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:
inline __m256i _mm256_hxor(__m256i x)
{
const __m256i temp = _mm256_xor_si256(x,
_mm256_permute2f128_si256(x, x, 1));
return _mm256_xor_si256(temp,
_mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}
This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.
If it helps to visualize it, this basically does:
A B C D ⟵ input
XOR XOR XOR XOR
C D A B ⟵ permuted input
=====================================
A^C B^D A^C B^D ⟵ intermediate result
XOR XOR XOR XOR
B^D A^C B^D A^C ⟵ shuffled intermediate result
======================================
A^C^B^D A^C^B^D A^C^B^D A^C^B^D ⟵ final result
On practically all compilers, these intrinsics will produce the following assembly code:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)
And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
vmovq rax, xmm0
The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.
Vectorization is likely to be useful if the input of the horizontal xor-function is already in
an AVX register, i.e. your t is the result of some SIMD computation.
Otherwise, scalar code is likely to be faster, as already mentioned by #Cody Gray.
Often you can do horizontal SIMD operations in about log_2(SIMD_width) 'steps'.
In this case one step is a 'shuffle/permute' and a 'xor'. This is slightly more efficient than #Cody Gray 's _mm256_hxor function:
__m256i _mm256_hxor_v2(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1); // swap the 128 bit high and low lane
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110); // swap 64 bit lanes
__m256i x3 = _mm256_xor_si256(x1,x2);
return x3;
}
This compiles to:
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
If you want the result in a scalar register:
uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}
Which compiles to:
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor %ymm1, %ymm0, %ymm0
vmovq %xmm0, %rax
Full test code:
#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
/* gcc -O3 -Wall -m64 -march=broadwell hor_xor.c */
int print_vec_uint64(__m256i v);
__m256i _mm256_hxor_v2(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
/* Uncomment the next few lines to print the values of the intermediate variables */
/*
printf("3...0 = 3 2 1 0\n");
printf("x = ");print_vec_uint64(x );
printf("x0 = ");print_vec_uint64(x0 );
printf("x1 = ");print_vec_uint64(x1 );
printf("x2 = ");print_vec_uint64(x2 );
printf("x3 = ");print_vec_uint64(x3 );
*/
return x3;
}
uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
__m256i x0 = _mm256_permute2x128_si256(x,x,1);
__m256i x1 = _mm256_xor_si256(x,x0);
__m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
__m256i x3 = _mm256_xor_si256(x1,x2);
return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}
int main() {
__m256i x = _mm256_set_epi64x(0x7, 0x5, 0x2, 0xB);
// __m256i x = _mm256_set_epi64x(4235566778345231, 1123312566778345423, 72345566778345673, 967856775433457);
printf("x = ");print_vec_uint64(x);
__m256i y = _mm256_hxor_v2(x);
printf("y = ");print_vec_uint64(y);
uint64_t z = _mm256_hxor_v2_uint64(x);
printf("z = %10lX \n",z);
return 0;
}
int print_vec_uint64(__m256i v){
uint64_t t[4];
_mm256_storeu_si256((__m256i *)t,v);
printf("%10lX %10lX %10lX %10lX \n",t[3],t[2],t[1],t[0]);
return 0;
}
Implementation of direct analogue of _mm256_hadd_epi32() for XOR will be look something like this:
#include <immintrin.h>
template<int imm> inline __m256i _mm256_shuffle_epi32(__m256i a, __m256i b)
{
return _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), imm));
}
inline __m256i _mm256_hxor_epi32(__m256i a, __m256i b)
{
return _mm256_xor_si256(_mm256_shuffle_epi32<0x88>(a, b), _mm256_shuffle_epi32<0xDD>(a, b));
}
int main()
{
__m256i a = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i b = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
__m256i c = _mm256_hxor_epi32(a, b);
return 0;
}
I tried to change this code to handle std::vector<int>.
float accumulate(const std::vector<float>& v)
{
// copy the length of v and a pointer to the data onto the local stack
const size_t N = v.size();
const float* p = (N > 0) ? &v.front() : NULL;
__m128 mmSum = _mm_setzero_ps();
size_t i = 0;
// unrolled loop that adds up 4 elements at a time
for(; i < ROUND_DOWN(N, 4); i+=4)
{
mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i));
}
// add up single values until all elements are covered
for(; i < N; i++)
{
mmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i));
}
// add up the four float values from mmSum into a single value and return
mmSum = _mm_hadd_ps(mmSum, mmSum);
mmSum = _mm_hadd_ps(mmSum, mmSum);
return _mm_cvtss_f32(mmSum);
}
Ref: http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html
I changed _mm_setzero_ps to _mm_setzero_si128, _mm_loadu_ps to mm_loadl_epi64 and _mm_add_ps to _mm_add_epi64.
I get this error:
error: cannot convert ‘const int*’ to ‘const __m128i* {aka const __vector(2) long long int*}’ for argument ‘1’ to ‘__m128i _mm_loadl_epi64(const __m128i*)’
mmSum = _mm_add_epi64(mmSum, _mm_loadl_epi64(p + i + 0));
I am novice in this field. Is there any good source to learn these things?
Here is an int version which I just threw together:
#include <iostream>
#include <vector>
#include <smmintrin.h> // SSE4
#define ROUND_DOWN(m, n) ((m) & ~((n) - 1))
static int accumulate(const std::vector<int>& v)
{
// copy the length of v and a pointer to the data onto the local stack
const size_t N = v.size();
const int* p = (N > 0) ? &v.front() : NULL;
__m128i mmSum = _mm_setzero_si128();
int sum = 0;
size_t i = 0;
// unrolled loop that adds up 4 elements at a time
for(; i < ROUND_DOWN(N, 4); i+=4)
{
mmSum = _mm_add_epi32(mmSum, _mm_loadu_si128((__m128i *)(p + i)));
}
// add up the four int values from mmSum into a single value
mmSum = _mm_hadd_epi32(mmSum, mmSum);
mmSum = _mm_hadd_epi32(mmSum, mmSum);
sum = _mm_extract_epi32(mmSum, 0);
// add up single values until all elements are covered
for(; i < N; i++)
{
sum += p[i];
}
return sum;
}
int main()
{
std::vector<int> v;
for (int i = 0; i < 10; ++i)
{
v.push_back(i);
}
int sum = accumulate(v);
std::cout << sum << std::endl;
return 0;
}
Compile and run:
$ g++ -Wall -msse4 -O3 accumulate.cpp && ./a.out
45
The ideal way to do this is to let the compiler auto-vectorize your code and keep your code simple and readable. You don't should not need anything more that
int sum = 0;
for(int i=0; i<v.size(); i++) sum += v[i];
The link you pointed to, http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html, does not seem to understand how to make the compiler vectorize the code.
For floating point, which is what that link uses, what you need to know is that floating point arithmetic is not associative and therefore depends on the order that you do the reduction. GCC, MSVC, and Clang will not do auto-vectorization for a reduction unless you tell it to use a different floating point model otherwise your result could depend on your hardware. ICC, however, defaults to associative floating point math so it will vectorize the code with e.g. -O3.
Not only will GCC, MSVC, and Clang not vectorize unless associative math is allowed but they won't unroll the loop to allow partial sums in order to overcome the latency of the summation. In this case only Clang and ICC will unroll to partial sums anyway. Clang unrolls four times and ICC twice.
One way to enable associative floating point arithmetic with GCC is with the -Ofast flag. With MSVC use /fp:fast
I tested the code below with GCC 4.9.2, XeonE5-1620 (IVB) # 3.60GHz, Ubuntu 15.04.
-O3 -mavx -fopenmp 0.93 s
-Ofast -mavx -fopenmp 0.19 s
-Ofast -mavx -fopenmp -funroll-loops 0.19 s
That's about a five times speed-up. Although, GCC does unroll the loop eight times it does not do independent partial sums (see the assembly below). This is the reason the unrolled version is no better.
I only used OpenMP for its convenient cross-platform/compiler timing function: omp_get_wtime().
Another advantage auto-vectorization has is it works for AVX simply by enabling a compiler switch (e.g. -mavx). Otherwise, if you wanted AVX, you would have to rewrite your code to use the AVX intrinsics and maybe have to ask another question on SO on how to do this.
So currently the only compiler which will auto-vectorize your loop as well as unroll to four partial sums is Clang. See the code and assembly at the end of this answer.
Here is the code I used to test the performance
#include <stdio.h>
#include <omp.h>
#include <vector>
float sumf(float *x, int n)
{
float sum = 0;
for(int i=0; i<n; i++) sum += x[i];
return sum;
}
#define N 10000 // the link used this value
int main(void)
{
std::vector<float> x;
for(int i=0; i<N; i++) x.push_back(1 -2*(i%2==0));
//float x[N]; for(int i=0; i<N; i++) x[i] = 1 -2*(i%2==0);
float sum = 0;
sum += sumf(x.data(),N);
double dtime = -omp_get_wtime();
for(int r=0; r<100000; r++) {
sum += sumf(x.data(),N);
}
dtime +=omp_get_wtime();
printf("sum %f time %f\n", sum, dtime);
}
Edit:
I should have taken my own advice and looked at the assembly.
The main loop for -O3. It's clear it only does a scalar sum.
.L3:
vaddss (%rdi), %xmm0, %xmm0
addq $4, %rdi
cmpq %rax, %rdi
jne .L3
The main loop for -Ofast. It does a vector sum but no unrolling.
.L8:
addl $1, %eax
vaddps (%r8), %ymm1, %ymm1
addq $32, %r8
cmpl %eax, %ecx
ja .L8
The main loop for -O3 -funroll-loops. Vector sum with 8x unroll
.L8:
vaddps (%rax), %ymm1, %ymm2
addl $8, %ebx
addq $256, %rax
vaddps -224(%rax), %ymm2, %ymm3
vaddps -192(%rax), %ymm3, %ymm4
vaddps -160(%rax), %ymm4, %ymm5
vaddps -128(%rax), %ymm5, %ymm6
vaddps -96(%rax), %ymm6, %ymm7
vaddps -64(%rax), %ymm7, %ymm8
vaddps -32(%rax), %ymm8, %ymm1
cmpl %ebx, %r9d
ja .L8
Edit:
Putting the following code in Clang 3.7 (-O3 -fverbose-asm -mavx)
float sumi(int *x)
{
x = (int*)__builtin_assume_aligned(x, 64);
int sum = 0;
for(int i=0; i<2048; i++) sum += x[i];
return sum;
}
produces the following assembly. Notice that it's vectorized to four independent partial sums.
sumi(int*): # #sumi(int*)
vpxor xmm0, xmm0, xmm0
xor eax, eax
vpxor xmm1, xmm1, xmm1
vpxor xmm2, xmm2, xmm2
vpxor xmm3, xmm3, xmm3
.LBB0_1: # %vector.body
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax]
vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 16]
vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 32]
vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 48]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax + 64]
vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 80]
vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 96]
vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 112]
add rax, 32
cmp rax, 2048
jne .LBB0_1
vpaddd xmm0, xmm1, xmm0
vpaddd xmm0, xmm2, xmm0
vpaddd xmm0, xmm3, xmm0
vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
vpaddd xmm0, xmm0, xmm1
vphaddd xmm0, xmm0, xmm0
vmovd eax, xmm0
vxorps xmm0, xmm0, xmm0
vcvtsi2ss xmm0, xmm0, eax
ret
static inline int32_t accumulate(const int32_t *data, size_t size) {
constexpr const static size_t batch = 256 / 8 / sizeof(int32_t);
int32_t sum = 0;
size_t pos = 0;
if (size >= batch) {
// 7
__m256i mmSum = _mm256_loadu_si256((__m256i *)(data));
pos = batch;
// unrolled loop
for (; pos + batch < size; pos += batch) {
// 1 + 7
mmSum =
_mm256_add_epi32(mmSum, _mm256_loadu_si256((__m256i *)(data + pos)));
}
mmSum = _mm256_hadd_epi32(mmSum, mmSum);
mmSum = _mm256_hadd_epi32(mmSum, mmSum);
// 2 + 1 + 3 + 0
sum = _mm_cvtsi128_si32(_mm_add_epi32(_mm256_extractf128_si256(mmSum, 1),
_mm256_castsi256_si128(mmSum)));
}
// add up remain values
while (pos < size) {
sum += data[pos++];
}
return sum;
}