Is pow(x,p) faster when the exponent is an integer? - c++

In a code using pow(double x, double p) (a big part of the cases have p = 2.0) I observed than the execution of my code is clearly faster when p = 2.0 than when p = 2.000000001. I conclude that, on my compiler (gcc 4.8.5), the implementation of pow detects when it's a square at runtime.
Following this observation, I conclude that I don't need a specific implementation when I know that p is 2. But my code must be cross-platform, then my question:
Is pow optimized when the exponent is an integer in most of the c++03 compilers ?
In my current context, "most of the compiler" = "gcc >= 4.8, intel with msvc, intel on unix"

Yes the standard libraries do attempt to do runtime optimization if the exponent is detected to be a natural number. Looking at the current version glibc i386 version of POW you can find the following code.
/* First see whether `y' is a natural number. In this case we
can use a more precise algorithm. */
fld %st // y : y : x
fistpll (%esp) // y : x
fildll (%esp) // int(y) : y : x
fucomp %st(1) // y : x
fnstsw
sahf
jne 3f
embedded in the implementation. The full code can be found at github.
Note that for other versions of glibc and other architectures the answer may differ.

EDIT
The answer below mininterprets the OP's question which was specifically about RUNTIME optimisation whereas I investigated compile time optimisation.
Original Answer
Adding to my comment. As long at the exponent is a contant int less than or equal to MAXINT then you get.
#include <cmath>
double pow(double a)
{
return std::pow(a, (int)2147483647);
}
generates
pow(double):
movapd xmm4, xmm0
mulsd xmm4, xmm0
movapd xmm5, xmm4
mulsd xmm5, xmm4
mulsd xmm4, xmm0
movapd xmm6, xmm5
mulsd xmm4, xmm5
mulsd xmm6, xmm5
movapd xmm3, xmm6
mulsd xmm3, xmm6
mulsd xmm3, xmm0
movapd xmm0, xmm4
movapd xmm2, xmm3
movapd xmm1, xmm3
mulsd xmm2, xmm6
mulsd xmm1, xmm3
mulsd xmm2, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm2
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm4
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm1, xmm1
mulsd xmm0, xmm1
ret
but you have to be careful to use an int literal
#include <cmath>
double pow(double a)
{
return std::pow(a, (unsigned int) 2147483647);
}
generates
pow(double):
movsd xmm1, QWORD PTR .LC0[rip]
jmp pow
.LC0:
.long 4290772992
.long 1105199103
EDIT
I seem to be wrong. The above was tested with an early version of GCC. In early versions of GCC and CLANG the multiplication is inlined. However in later versions this does not happen. It is possible that newer versions of If you switch the versions on godbolt then you see that the above DOES NOT OCCUR.
For example
#include <cmath>
double pow_v2(double a)
{
return std::pow(a, 2);
}
double pow_v3(double a)
{
return std::pow(a, 3);
}
for CLANG 10.0 generates
pow_v2(double): # #pow_v2(double)
mulsd xmm0, xmm0
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3
pow_v3(double): # #pow_v3(double)
movsd xmm1, qword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero
jmp pow # TAILCALL
but for CLANG 5.0 it generates
pow_v2(double): # #pow_v2(double)
mulsd xmm0, xmm0
ret
pow_v3(double): # #pow_v3(double)
movapd xmm1, xmm0
mulsd xmm1, xmm1
mulsd xmm1, xmm0
movapd xmm0, xmm1
ret
It seems that for later versions of the compilers the intrinsic pow function is faster to call than inlining the multiplications so the compilers change their strategy.

Related

Why does -mprefer-vector-width=128 run 3x faster than -mprefer-vector-width=512 for implicitly vectorized code?

With Godbolt.org benchmarking:
-march=native: https://godbolt.org/z/Ys6rqbGGe
-march=cascadelake: https://godbolt.org/z/W3EM5hKsq
the -mprefer-vector-width=128 flag makes it complete sqrt operation in 0.67 cycles while -mprefer-vector-width=512 causes it to complete sqrt operation in 1.95 cycles.
Source code:
#include <omp.h>
#include <iostream>
#include <string>
#include <functional>
#include<cmath>
template<typename Type, int Simd>
struct KernelData
{
alignas(32)
Type data[Simd];
inline void readFrom(const Type * const __restrict__ ptr) noexcept
{
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
data[i] = ptr[i];
}
}
inline void writeTo(Type * const __restrict__ ptr) const noexcept
{
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
ptr[i] = data[i];
}
}
inline const KernelData<Type,Simd> sqrt() const noexcept
{
KernelData<Type,Simd> result;
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
result.data[i] = std::sqrt(data[i]);
}
return result;
}
};
template<int mask>
struct KernelDataFactory
{
KernelDataFactory()
{
}
template<typename Type>
inline
KernelData<Type,mask> generate() const
{
return KernelData<Type,mask>();
}
};
template<int SimdWidth, typename... Args>
class Kernel
{
public:
Kernel(std::function<void(int,int, Args...)> kernelPrm)
{
kernel = kernelPrm;
}
void run(int n, Args... args)
{
const int nLoop = (n/SimdWidth);
for(int i=0;i<nLoop;i++)
{
kernel(i*SimdWidth,SimdWidth, args...);
}
if((n/SimdWidth)*SimdWidth != n)
{
const int m = n%SimdWidth;
for(int i=0;i<m;i++)
{
kernel(nLoop*SimdWidth+i,1, args...);
}
}
}
private:
std::function<void(int,int, Args...)> kernel;
};
// cpu cycles from stackoverflow
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
int main(int argC, char** argV)
{
constexpr int simd = 16;
constexpr int n = 1003;
Kernel<simd, float *, float *> kernel([](int simdGroupId, int simdWidth, float * input, float * output){
const int id = simdGroupId;
if(simdWidth == simd)
{
const KernelDataFactory<simd> factory;
auto a = factory.generate<float>();
a.readFrom(input+id);
const auto b = a.sqrt().sqrt().sqrt().sqrt().sqrt().
sqrt().sqrt().sqrt().sqrt().sqrt().
sqrt().sqrt().sqrt().sqrt().sqrt();
b.writeTo(output+id);
}
else
{
const KernelDataFactory<1> factory;
auto a = factory.generate<float>();
a.readFrom(input+id);
const auto b = a.sqrt().sqrt().sqrt().sqrt().sqrt().
sqrt().sqrt().sqrt().sqrt().sqrt().
sqrt().sqrt().sqrt().sqrt().sqrt();
b.writeTo(output+id);
}
});
alignas(32)
float i[n],o[n];
for(int j=0;j<n;j++)
i[j]=j;
auto t1 = readTSC();
for(int k=0;k<10000;k++)
kernel.run(n,i,o);
auto t2 = readTSC();
for(int i=n-10;i<n;i++)
{
std::cout<<"i="<<i<<" value="<<o[i]<<std::endl;
}
std::cout<<0.0001f*(t2-t1)/(float)(15*n)<<" cycles per sqrt"<<std::endl;
return 0;
}
Godbolt output for 128 bit preference (-std=c++2a -O3 -march=native -mprefer-vector-width=128 -ftree-vectorize -fno-math-errno -lpthread):
push rbp
add rsi, rcx
mov rbp, rsp
and rsp, -32
sub rsp, 8
vsqrtps xmm0, XMMWORD PTR [rdx]
vsqrtps xmm3, XMMWORD PTR [rdx+16]
vsqrtps xmm2, XMMWORD PTR [rdx+32]
vsqrtps xmm1, XMMWORD PTR [rdx+48]
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vsqrtps xmm0, xmm0
vsqrtps xmm3, xmm3
vsqrtps xmm2, xmm2
vsqrtps xmm1, xmm1
vmovaps XMMWORD PTR [rsp-40], xmm3
vmovaps XMMWORD PTR [rsp-24], xmm2
vmovaps XMMWORD PTR [rsp-8], xmm1
vmovdqu XMMWORD PTR [rsi], xmm0
vmovdqa xmm4, XMMWORD PTR [rsp-40]
vmovdqu XMMWORD PTR [rsi+16], xmm4
vmovdqa xmm5, XMMWORD PTR [rsp-24]
vmovdqu XMMWORD PTR [rsi+32], xmm5
vmovdqa xmm6, XMMWORD PTR [rsp-8]
vmovdqu XMMWORD PTR [rsi+48], xmm6
leave
ret
Godbolt output for 512 bit preference (-std=c++2a -O3 -march=native -mprefer-vector-width=512 -ftree-vectorize -fno-math-errno -lpthread):
push rbp
add rsi, rcx
mov rbp, rsp
and rsp, -64
sub rsp, 8
vmovdqu xmm2, XMMWORD PTR [rdx]
vmovdqu xmm3, XMMWORD PTR [rdx+16]
vmovdqu xmm4, XMMWORD PTR [rdx+32]
vmovdqu xmm5, XMMWORD PTR [rdx+48]
vmovdqa XMMWORD PTR [rsp-120], xmm2
vmovdqa XMMWORD PTR [rsp-104], xmm3
vmovdqa XMMWORD PTR [rsp-88], xmm4
vmovdqa XMMWORD PTR [rsp-72], xmm5
vsqrtps zmm0, ZMMWORD PTR [rsp-120]
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vsqrtps zmm0, zmm0
vmovaps ZMMWORD PTR [rsp-56], zmm0
vmovdqu XMMWORD PTR [rsi], xmm0
vmovdqa xmm6, XMMWORD PTR [rsp-40]
vmovdqu XMMWORD PTR [rsi+16], xmm6
vmovdqa xmm7, XMMWORD PTR [rsp-24]
vmovdqu XMMWORD PTR [rsi+32], xmm7
vmovdqa xmm1, XMMWORD PTR [rsp-8]
vmovdqu XMMWORD PTR [rsi+48], xmm1
vzeroupper
leave
ret
How does AVX512-related packed square root operation run a lot slower than 128bit?

Auto-vectorization for hand-unrolled initialized tiled-computation versus simple loop with no initialization

In optimization for an AABB collision detection algorithm's inner-most 4-versus-4 comparison part, I am stuck at simplifying code at the same time gaining(or just retaining) performance.
Here is the version with hand-unrolled initialization:
https://godbolt.org/z/TMGMhdsss
inline
const int intersectDim(const float minx, const float maxx, const float minx2, const float maxx2) noexcept
{
return !((maxx < minx2) || (maxx2 < minx));
}
inline
void comp4vs4( const int * const __restrict__ partId1, const int * const __restrict__ partId2,
const float * const __restrict__ minx1, const float * const __restrict__ minx2,
const float * const __restrict__ miny1, const float * const __restrict__ miny2,
const float * const __restrict__ minz1, const float * const __restrict__ minz2,
const float * const __restrict__ maxx1, const float * const __restrict__ maxx2,
const float * const __restrict__ maxy1, const float * const __restrict__ maxy2,
const float * const __restrict__ maxz1, const float * const __restrict__ maxz2,
int * const __restrict__ out
)
{
alignas(32)
int result[16]={
// 0v0 0v1 0v2 0v3
// 1v0 1v1 1v2 1v3
// 2v0 2v1 2v2 2v3
// 3v0 3v1 3v2 3v3
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0
};
alignas(32)
int tileId1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
partId1[0],partId1[1],partId1[2],partId1[3],
partId1[0],partId1[1],partId1[2],partId1[3],
partId1[0],partId1[1],partId1[2],partId1[3],
partId1[0],partId1[1],partId1[2],partId1[3]
};
alignas(32)
int tileId2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
partId2[0],partId2[0],partId2[0],partId2[0],
partId2[1],partId2[1],partId2[1],partId2[1],
partId2[2],partId2[2],partId2[2],partId2[2],
partId2[3],partId2[3],partId2[3],partId2[3]
};
alignas(32)
float tileMinX1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
minx1[0],minx1[1],minx1[2],minx1[3],
minx1[0],minx1[1],minx1[2],minx1[3],
minx1[0],minx1[1],minx1[2],minx1[3],
minx1[0],minx1[1],minx1[2],minx1[3]
};
alignas(32)
float tileMinX2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
minx2[0],minx2[0],minx2[0],minx2[0],
minx2[1],minx2[1],minx2[1],minx2[1],
minx2[2],minx2[2],minx2[2],minx2[2],
minx2[3],minx2[3],minx2[3],minx2[3]
};
alignas(32)
float tileMinY1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
miny1[0],miny1[1],miny1[2],miny1[3],
miny1[0],miny1[1],miny1[2],miny1[3],
miny1[0],miny1[1],miny1[2],miny1[3],
miny1[0],miny1[1],miny1[2],miny1[3]
};
alignas(32)
float tileMinY2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
miny2[0],miny2[0],miny2[0],miny2[0],
miny2[1],miny2[1],miny2[1],miny2[1],
miny2[2],miny2[2],miny2[2],miny2[2],
miny2[3],miny2[3],miny2[3],miny2[3]
};
alignas(32)
float tileMinZ1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
minz1[0],minz1[1],minz1[2],minz1[3],
minz1[0],minz1[1],minz1[2],minz1[3],
minz1[0],minz1[1],minz1[2],minz1[3],
minz1[0],minz1[1],minz1[2],minz1[3]
};
alignas(32)
float tileMinZ2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
minz2[0],minz2[0],minz2[0],minz2[0],
minz2[1],minz2[1],minz2[1],minz2[1],
minz2[2],minz2[2],minz2[2],minz2[2],
minz2[3],minz2[3],minz2[3],minz2[3]
};
alignas(32)
float tileMaxX1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
maxx1[0],maxx1[1],maxx1[2],maxx1[3],
maxx1[0],maxx1[1],maxx1[2],maxx1[3],
maxx1[0],maxx1[1],maxx1[2],maxx1[3],
maxx1[0],maxx1[1],maxx1[2],maxx1[3]
};
alignas(32)
float tileMaxX2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
maxx2[0],maxx2[0],maxx2[0],maxx2[0],
maxx2[1],maxx2[1],maxx2[1],maxx2[1],
maxx2[2],maxx2[2],maxx2[2],maxx2[2],
maxx2[3],maxx2[3],maxx2[3],maxx2[3]
};
alignas(32)
float tileMaxY1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
maxy1[0],maxy1[1],maxy1[2],maxy1[3],
maxy1[0],maxy1[1],maxy1[2],maxy1[3],
maxy1[0],maxy1[1],maxy1[2],maxy1[3],
maxy1[0],maxy1[1],maxy1[2],maxy1[3]
};
alignas(32)
float tileMaxY2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
maxy2[0],maxy2[0],maxy2[0],maxy2[0],
maxy2[1],maxy2[1],maxy2[1],maxy2[1],
maxy2[2],maxy2[2],maxy2[2],maxy2[2],
maxy2[3],maxy2[3],maxy2[3],maxy2[3]
};
alignas(32)
float tileMaxZ1[16]={
// 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3
maxz1[0],maxz1[1],maxz1[2],maxz1[3],
maxz1[0],maxz1[1],maxz1[2],maxz1[3],
maxz1[0],maxz1[1],maxz1[2],maxz1[3],
maxz1[0],maxz1[1],maxz1[2],maxz1[3]
};
alignas(32)
float tileMaxZ2[16]={
// 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
maxz2[0],maxz2[0],maxz2[0],maxz2[0],
maxz2[1],maxz2[1],maxz2[1],maxz2[1],
maxz2[2],maxz2[2],maxz2[2],maxz2[2],
maxz2[3],maxz2[3],maxz2[3],maxz2[3]
};
for(int i=0;i<16;i++)
result[i] = (tileId1[i] < tileId2[i]);
for(int i=0;i<16;i++)
result[i] = result[i] &&
intersectDim(tileMinX1[i], tileMaxX1[i], tileMinX2[i], tileMaxX2[i]) &&
intersectDim(tileMinY1[i], tileMaxY1[i], tileMinY2[i], tileMaxY2[i]) &&
intersectDim(tileMinZ1[i], tileMaxZ1[i], tileMinZ2[i], tileMaxZ2[i]);
for(int i=0;i<16;i++)
out[i]=result[i];
}
#include<iostream>
int main()
{
int tile1[4];int tile2[4];
float tile3[4];float tile4[4];
float tile5[4];float tile6[4];
float tile7[4];float tile8[4];
float tile9[4];float tile10[4];
float tile11[4];float tile12[4];
float tile13[4];float tile14[4];
for(int i=0;i<4;i++)
{
std::cin>>tile1[i];
std::cin>>tile2[i];
std::cin>>tile3[i];
std::cin>>tile4[i];
std::cin>>tile5[i];
std::cin>>tile6[i];
std::cin>>tile7[i];
std::cin>>tile8[i];
std::cin>>tile9[i];
std::cin>>tile10[i];
std::cin>>tile11[i];
std::cin>>tile12[i];
std::cin>>tile13[i];
std::cin>>tile14[i];
}
int out[16];
comp4vs4(tile1,tile2,tile3,tile4,tile5,tile6,tile7,tile8,tile9,
tile10,tile11,tile12,tile13,tile14,out);
for(int i=0;i<16;i++)
std::cout<<out[i];
return 0;
}
and its output from godbolt:
comp4vs4(int const*, int const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, int*):
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 8
mov rax, QWORD PTR [rbp+80]
vmovups xmm0, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+16]
vmovups xmm6, XMMWORD PTR [rcx]
vmovups xmm5, XMMWORD PTR [r9]
vmovups xmm9, XMMWORD PTR [r8]
vmovdqu xmm15, XMMWORD PTR [rsi]
vmovdqu xmm8, XMMWORD PTR [rdi]
vmovups xmm2, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+24]
vpermilps xmm1, xmm6, 0
vmovdqa XMMWORD PTR [rsp-88], xmm15
vmovups xmm4, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+32]
vmovups xmm14, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+40]
vmovups xmm11, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+48]
vcmpleps xmm3, xmm1, xmm14
vmovups xmm13, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+56]
vpermilps xmm1, xmm11, 0
vcmpleps xmm1, xmm0, xmm1
vmovups xmm10, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+64]
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm5, 0
vcmpleps xmm3, xmm3, xmm13
vmovups xmm7, XMMWORD PTR [rdx]
mov rdx, QWORD PTR [rbp+72]
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm10, 0
vmovaps XMMWORD PTR [rsp-72], xmm7
vcmpleps xmm3, xmm9, xmm3
vmovups xmm7, XMMWORD PTR [rdx]
vpand xmm1, xmm1, xmm3
vpshufd xmm3, xmm15, 0
vpcomltd xmm3, xmm8, xmm3
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm7, 0
vcmpleps xmm12, xmm2, xmm3
vpermilps xmm3, xmm4, 0
vcmpleps xmm3, xmm3, XMMWORD PTR [rsp-72]
vpand xmm3, xmm3, xmm12
vmovdqa xmm12, XMMWORD PTR .LC0[rip]
vpand xmm3, xmm3, xmm12
vpand xmm1, xmm1, xmm3
vmovdqa XMMWORD PTR [rsp-104], xmm1
vpermilps xmm1, xmm6, 85
vcmpleps xmm3, xmm1, xmm14
vpermilps xmm1, xmm11, 85
vcmpleps xmm1, xmm0, xmm1
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm5, 85
vcmpleps xmm3, xmm3, xmm13
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm10, 85
vcmpleps xmm3, xmm9, xmm3
vpand xmm1, xmm1, xmm3
vpshufd xmm3, xmm15, 85
vpermilps xmm15, xmm4, 85
vpcomltd xmm3, xmm8, xmm3
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm7, 85
vcmpleps xmm15, xmm15, XMMWORD PTR [rsp-72]
vcmpleps xmm3, xmm2, xmm3
vpand xmm3, xmm3, xmm15
vpermilps xmm15, xmm4, 170
vpand xmm3, xmm3, xmm12
vpermilps xmm4, xmm4, 255
vcmpleps xmm15, xmm15, XMMWORD PTR [rsp-72]
vpand xmm1, xmm1, xmm3
vcmpleps xmm4, xmm4, XMMWORD PTR [rsp-72]
vmovdqa XMMWORD PTR [rsp-120], xmm1
vpermilps xmm1, xmm6, 170
vpermilps xmm6, xmm6, 255
vcmpleps xmm3, xmm1, xmm14
vpermilps xmm1, xmm11, 170
vpermilps xmm11, xmm11, 255
vcmpleps xmm6, xmm6, xmm14
vcmpleps xmm1, xmm0, xmm1
vcmpleps xmm11, xmm0, xmm11
vpshufd xmm0, XMMWORD PTR [rsp-88], 255
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm5, 170
vpermilps xmm5, xmm5, 255
vcmpleps xmm3, xmm3, xmm13
vpand xmm6, xmm11, xmm6
vcmpleps xmm13, xmm5, xmm13
vmovdqa xmm5, XMMWORD PTR [rsp-104]
vpand xmm1, xmm1, xmm3
vpermilps xmm3, xmm10, 170
vpermilps xmm10, xmm10, 255
vcmpleps xmm3, xmm9, xmm3
vpand xmm6, xmm6, xmm13
vmovdqu XMMWORD PTR [rax], xmm5
vcmpleps xmm9, xmm9, xmm10
vpand xmm1, xmm1, xmm3
vpshufd xmm3, XMMWORD PTR [rsp-88], 170
vpand xmm9, xmm6, xmm9
vpcomltd xmm3, xmm8, xmm3
vpand xmm1, xmm1, xmm3
vpcomltd xmm8, xmm8, xmm0
vmovdqa xmm0, XMMWORD PTR [rsp-120]
vpermilps xmm3, xmm7, 170
vpermilps xmm7, xmm7, 255
vcmpleps xmm3, xmm2, xmm3
vpand xmm8, xmm9, xmm8
vcmpleps xmm2, xmm2, xmm7
vmovdqu XMMWORD PTR [rax+16], xmm0
vpand xmm3, xmm3, xmm15
vpand xmm2, xmm2, xmm4
vpand xmm3, xmm3, xmm12
vpand xmm12, xmm2, xmm12
vpand xmm3, xmm1, xmm3
vpand xmm12, xmm8, xmm12
vmovdqu XMMWORD PTR [rax+32], xmm3
vmovdqu XMMWORD PTR [rax+48], xmm12
leave
ret
main: // character limit 30k
It is ~123 lines of vector instructions. Since it runs somewhat ok performance, I tried to simplify it with simple bitwise operations:
https://godbolt.org/z/zKqe49a73
inline
const int intersectDim(const float minx, const float maxx, const float minx2, const float maxx2) noexcept
{
return !((maxx < minx2) || (maxx2 < minx));
}
inline
void comp4vs4( const int * const __restrict__ partId1, const int * const __restrict__ partId2,
const float * const __restrict__ minx1, const float * const __restrict__ minx2,
const float * const __restrict__ miny1, const float * const __restrict__ miny2,
const float * const __restrict__ minz1, const float * const __restrict__ minz2,
const float * const __restrict__ maxx1, const float * const __restrict__ maxx2,
const float * const __restrict__ maxy1, const float * const __restrict__ maxy2,
const float * const __restrict__ maxz1, const float * const __restrict__ maxz2,
int * const __restrict__ out
)
{
alignas(32)
int result[16]={
// 0v0 0v1 0v2 0v3
// 1v0 1v1 1v2 1v3
// 2v0 2v1 2v2 2v3
// 3v0 3v1 3v2 3v3
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0
};
for(int i=0;i<16;i++)
result[i] = partId1[i&3]<partId2[i/4];
for(int i=0;i<16;i++)
result[i] = result[i] &&
intersectDim(minx1[i&3], maxx1[i&3], minx2[i/4], maxx2[i/4]) &&
intersectDim(miny1[i&3], maxy1[i&3], miny2[i/4], maxy2[i/4]) &&
intersectDim(minz1[i&3], maxz1[i&3], minz2[i/4], maxz2[i/4]);
for(int i=0;i<16;i++)
out[i]=result[i];
}
#include<iostream>
int main()
{
int tile1[4];int tile2[4];
float tile3[4];float tile4[4];
float tile5[4];float tile6[4];
float tile7[4];float tile8[4];
float tile9[4];float tile10[4];
float tile11[4];float tile12[4];
float tile13[4];float tile14[4];
for(int i=0;i<4;i++)
{
std::cin>>tile1[i];
std::cin>>tile2[i];
std::cin>>tile3[i];
std::cin>>tile4[i];
std::cin>>tile5[i];
std::cin>>tile6[i];
std::cin>>tile7[i];
std::cin>>tile8[i];
std::cin>>tile9[i];
std::cin>>tile10[i];
std::cin>>tile11[i];
std::cin>>tile12[i];
std::cin>>tile13[i];
std::cin>>tile14[i];
}
int out[16];
comp4vs4(tile1,tile2,tile3,tile4,tile5,tile6,tile7,tile8,tile9,
tile10,tile11,tile12,tile13,tile14,out);
for(int i=0;i<16;i++)
std::cout<<out[i];
return 0;
}
how godbolt outputs:
main:
// character limit 30k
vpxor xmm0, xmm0, xmm0
vmovdqa xmm3, XMMWORD PTR .LC0[rip]
lea rax, [rsp+240]
vpxor xmm4, xmm4, xmm4
vmovdqa XMMWORD PTR [rsp+224], xmm0
vmovdqa XMMWORD PTR [rsp+240], xmm0
vmovdqa XMMWORD PTR [rsp+256], xmm0
vmovdqa XMMWORD PTR [rsp+272], xmm0
vpcmpeqd xmm0, xmm0, xmm0
vmovdqa xmm7, xmm0
vmovdqa xmm6, xmm0
vmovdqa xmm5, xmm0
vpgatherdd xmm2, DWORD PTR [rsp+16+xmm4*4], xmm7
vmovdqa xmm4, XMMWORD PTR .LC1[rip]
vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm6
vmovdqa xmm7, xmm0
vmovdqa xmm6, xmm0
vpcomltd xmm1, xmm1, xmm2
vpand xmm1, xmm1, xmm4
vmovdqa XMMWORD PTR [rsp+224], xmm1
vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm6
vpgatherdd xmm2, DWORD PTR [rsp+16+xmm4*4], xmm7
vmovdqa xmm6, xmm0
vmovdqa xmm7, xmm0
vpcomltd xmm1, xmm1, xmm2
vpand xmm1, xmm1, xmm4
vmovdqa XMMWORD PTR [rsp+240], xmm1
vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm5
vmovdqa xmm5, XMMWORD PTR .LC2[rip]
vpgatherdd xmm2, DWORD PTR [rsp+16+xmm5*4], xmm6
vmovdqa xmm5, XMMWORD PTR .LC3[rip]
vmovdqa xmm6, xmm0
vpcomltd xmm1, xmm1, xmm2
vpand xmm1, xmm1, xmm4
vmovdqa XMMWORD PTR [rsp+256], xmm1
vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm7
vpgatherdd xmm0, DWORD PTR [rsp+16+xmm5*4], xmm6
vmovdqa xmm7, XMMWORD PTR .LC4[rip]
vpxor xmm6, xmm6, xmm6
lea rdx, [rsp+304]
vpcomltd xmm0, xmm1, xmm0
vpand xmm0, xmm0, xmm4
vmovdqa XMMWORD PTR [rsp+272], xmm0
.L3:
vmovdqa xmm0, XMMWORD PTR [rax-16]
vmovdqa xmm2, xmm3
prefetcht0 [rax]
add rax, 16
vpaddd xmm3, xmm3, xmm7
vpsrad xmm8, xmm2, 2
vpand xmm2, xmm2, xmm5
vpcomneqd xmm1, xmm0, xmm6
vmovaps xmm0, xmm1
vmovaps xmm11, xmm1
vmovaps xmm12, xmm1
vmovaps xmm13, xmm1
vgatherdps xmm11, DWORD PTR [rsp+144+xmm8*4], xmm0
vmovaps xmm14, xmm1
vmovaps xmm0, xmm1
vmovaps xmm10, xmm1
vmovaps xmm9, xmm1
vgatherdps xmm10, DWORD PTR [rsp+128+xmm2*4], xmm13
vgatherdps xmm0, DWORD PTR [r13+0+xmm8*4], xmm12
vgatherdps xmm9, DWORD PTR [rsp+32+xmm2*4], xmm14
vcmpleps xmm0, xmm0, xmm10
vcmpleps xmm9, xmm9, xmm11
vpand xmm0, xmm0, xmm9
vpand xmm1, xmm0, xmm1
vmovaps xmm0, xmm1
vmovaps xmm11, xmm1
vmovaps xmm15, xmm1
vmovaps xmm10, xmm1
vgatherdps xmm11, DWORD PTR [r15+xmm8*4], xmm0
vmovaps xmm12, xmm1
vmovaps xmm0, xmm1
vmovaps xmm9, xmm1
vmovaps xmm13, xmm1
vgatherdps xmm10, DWORD PTR [r12+xmm2*4], xmm12
vgatherdps xmm0, DWORD PTR [rsp+80+xmm8*4], xmm15
vgatherdps xmm9, DWORD PTR [rsp+64+xmm2*4], xmm13
vcmpleps xmm0, xmm0, xmm10
vcmpleps xmm9, xmm9, xmm11
vpand xmm0, xmm0, xmm9
vpand xmm0, xmm0, xmm1
vmovaps xmm1, xmm0
vmovaps xmm10, xmm0
vmovaps xmm9, xmm0
vmovaps xmm14, xmm0
vgatherdps xmm10, DWORD PTR [rsp+208+xmm8*4], xmm1
vmovaps xmm1, xmm0
vgatherdps xmm9, DWORD PTR [r14+xmm8*4], xmm1
vmovaps xmm1, xmm0
vmovaps xmm8, xmm0
vgatherdps xmm8, DWORD PTR [rsp+192+xmm2*4], xmm1
vmovaps xmm1, xmm0
vgatherdps xmm1, DWORD PTR [rsp+96+xmm2*4], xmm14
vcmpleps xmm2, xmm9, xmm8
vcmpleps xmm1, xmm1, xmm10
vpand xmm1, xmm1, xmm2
vpand xmm1, xmm1, xmm4
vpand xmm0, xmm0, xmm1
vmovdqa XMMWORD PTR [rax-32], xmm0
cmp rdx, rax
jne .L3
vmovdqa xmm5, XMMWORD PTR [rsp+224]
vmovdqa xmm7, XMMWORD PTR [rsp+240]
vmovdqa xmm4, XMMWORD PTR [rsp+256]
lea rbx, [rsp+288]
lea r12, [rsp+352]
vmovdqa XMMWORD PTR [rsp+288], xmm5
vmovdqa xmm5, XMMWORD PTR [rsp+272]
vmovdqa XMMWORD PTR [rsp+304], xmm7
vmovdqa XMMWORD PTR [rsp+320], xmm4
vmovdqa XMMWORD PTR [rsp+336], xmm5
.L4:
// character limit 30k
it has ~110 lines of vector instructions. Despite having less instructions than first version, it runs at half performance (at least on bdver1 compiler flag). Is it because of "AND" and division operations for indexing?
Also, parameters using the restrict keyword are pointing to same memory occasionally. Could this be a problem for performance?
If it helps, here are performance-test source codes on some online-service with avx512-cpu (maximum 32 AABBs per leaf node):
Unrolled version: https://rextester.com/YKDN52107
Readable version: https://rextester.com/TAFR72415 (slow)
A bit less performance difference when 128 AABBs per leaf node(tested in godbolt server):
Unrolled: https://godbolt.org/z/rx13zorjr
Readable: https://godbolt.org/z/e1cfbEPKn

Why my SSE code is slower than native C++ code?

First of all, I am new to SSE. I decided to accelerate my code, but it seems, that it works slower, then my native code.
This is an example, that calculates the sum of squares. On my Intel i7-6700HQ, it takes 0.43s for native code and 0.52 for SSE. So, where is a bottleneck?
inline float squared_sum(const float x, const float y)
{
return x * x + y * y;
}
#define USE_SIMD
void calculations()
{
high_resolution_clock::time_point t1, t2;
int result_v = 0;
t1 = high_resolution_clock::now();
alignas(16) float data_x[4];
alignas(16) float data_y[4];
alignas(16) float result[4];
__m128 v_x, v_y, v_res;
for (int y = 0; y < 5120; y++)
{
data_y[0] = y;
data_y[1] = y + 1;
data_y[2] = y + 2;
data_y[3] = y + 3;
for (int x = 0; x < 5120; x++)
{
data_x[0] = x;
data_x[1] = x + 1;
data_x[2] = x + 2;
data_x[3] = x + 3;
#ifdef USE_SIMD
v_x = _mm_load_ps(data_x);
v_y = _mm_load_ps(data_y);
v_x = _mm_mul_ps(v_x, v_x);
v_y = _mm_mul_ps(v_y, v_y);
v_res = _mm_add_ps(v_x, v_y);
_mm_store_ps(result, v_res);
#else
result[0] = squared_sum(data_x[0], data_y[0]);
result[1] = squared_sum(data_x[1], data_y[1]);
result[2] = squared_sum(data_x[2], data_y[2]);
result[3] = squared_sum(data_x[3], data_y[3]);
#endif
result_v += (int)(result[0] + result[1] + result[2] + result[3]);
}
}
t2 = high_resolution_clock::now();
duration<double> time_span1 = duration_cast<duration<double>>(t2 - t1);
std::cout << "Exec time:\t" << time_span1.count() << " s\n";
}
UPDATE: fixed code according to comments.
I am using Visual Studio 2017. Compiled for x64.
Optimization: Maximum Optimization (Favor Speed) (/O2);
Inline Function Expansion: Any Suitable (/Ob2);
Favor Size or Speed: Favor fast code (/Ot);
Omit Frame Pointers: Yes (/Oy)
Conclusion
Compilers generate already optimized code, so nowadays it is hard to accelerate it even more. The one thing you can do, to accelerate code more, is parallelization.
Thanks for the answers. They mainly the same, so I accept Søren V. Poulsen answer because it was the first.
Modern compiles are incredible machines and will already use SIMD instructions if possible (and with the correct compilation flags).
One general strategy to determine what the compiler is doing is looking at the disassembly of your code. If you don't want to do it on your own machine you can use an online service like Godbolt: https://gcc.godbolt.org/z/T6GooQ.
One tip is to avoid atomic for storing intermediate results like you are doing here. Atomic values are used to ensure synchronization between threads, and this may come at a very high computational cost, relatively speaking.
Looking through the assembly for the compiler's code based (without your SIMD stuff),
calculations():
pxor xmm2, xmm2
xor edx, edx
movdqa xmm0, XMMWORD PTR .LC0[rip]
movdqa xmm11, XMMWORD PTR .LC1[rip]
movdqa xmm9, XMMWORD PTR .LC2[rip]
movdqa xmm8, XMMWORD PTR .LC3[rip]
movdqa xmm7, XMMWORD PTR .LC4[rip]
.L4:
movdqa xmm5, xmm0
movdqa xmm4, xmm0
cvtdq2ps xmm6, xmm0
movdqa xmm10, xmm0
paddd xmm0, xmm7
cvtdq2ps xmm3, xmm0
paddd xmm5, xmm9
paddd xmm4, xmm8
cvtdq2ps xmm5, xmm5
cvtdq2ps xmm4, xmm4
mulps xmm6, xmm6
mov eax, 5120
paddd xmm10, xmm11
mulps xmm5, xmm5
mulps xmm4, xmm4
mulps xmm3, xmm3
pxor xmm12, xmm12
.L2:
movdqa xmm1, xmm12
cvtdq2ps xmm14, xmm12
mulps xmm14, xmm14
movdqa xmm13, xmm12
paddd xmm12, xmm7
cvtdq2ps xmm12, xmm12
paddd xmm1, xmm9
cvtdq2ps xmm0, xmm1
mulps xmm0, xmm0
paddd xmm13, xmm8
cvtdq2ps xmm13, xmm13
sub eax, 1
mulps xmm13, xmm13
addps xmm14, xmm6
mulps xmm12, xmm12
addps xmm0, xmm5
addps xmm13, xmm4
addps xmm12, xmm3
addps xmm0, xmm14
addps xmm0, xmm13
addps xmm0, xmm12
movdqa xmm12, xmm1
cvttps2dq xmm0, xmm0
paddd xmm2, xmm0
jne .L2
add edx, 1
movdqa xmm0, xmm10
cmp edx, 1280
jne .L4
movdqa xmm0, xmm2
psrldq xmm0, 8
paddd xmm2, xmm0
movdqa xmm0, xmm2
psrldq xmm0, 4
paddd xmm2, xmm0
movd eax, xmm2
ret
main:
xor eax, eax
ret
_GLOBAL__sub_I_calculations():
sub rsp, 8
mov edi, OFFSET FLAT:_ZStL8__ioinit
call std::ios_base::Init::Init() [complete object constructor]
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:_ZStL8__ioinit
mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
add rsp, 8
jmp __cxa_atexit
.LC0:
.long 0
.long 1
.long 2
.long 3
.LC1:
.long 4
.long 4
.long 4
.long 4
.LC2:
.long 1
.long 1
.long 1
.long 1
.LC3:
.long 2
.long 2
.long 2
.long 2
.LC4:
.long 3
.long 3
.long 3
.long 3
Your SIMD code generates:
calculations():
pxor xmm5, xmm5
xor eax, eax
mov r8d, 1
movabs rdi, -4294967296
cvtsi2ss xmm5, eax
.L4:
mov r9d, r8d
mov esi, 1
movd edx, xmm5
pxor xmm5, xmm5
pxor xmm4, xmm4
mov ecx, edx
mov rdx, QWORD PTR [rsp-24]
cvtsi2ss xmm5, r8d
add r8d, 1
cvtsi2ss xmm4, r8d
and rdx, rdi
or rdx, rcx
pxor xmm2, xmm2
mov edx, edx
movd ecx, xmm5
sal rcx, 32
or rdx, rcx
mov QWORD PTR [rsp-24], rdx
movd edx, xmm4
pxor xmm4, xmm4
mov ecx, edx
mov rdx, QWORD PTR [rsp-16]
and rdx, rdi
or rdx, rcx
lea ecx, [r9+2]
mov edx, edx
cvtsi2ss xmm4, ecx
movd ecx, xmm4
sal rcx, 32
or rdx, rcx
mov QWORD PTR [rsp-16], rdx
movaps xmm4, XMMWORD PTR [rsp-24]
mulps xmm4, xmm4
.L2:
movd edx, xmm2
mov r10d, esi
pxor xmm2, xmm2
pxor xmm7, xmm7
mov ecx, edx
mov rdx, QWORD PTR [rsp-40]
cvtsi2ss xmm2, esi
add esi, 1
and rdx, rdi
cvtsi2ss xmm7, esi
or rdx, rcx
mov ecx, edx
movd r11d, xmm2
movd edx, xmm7
sal r11, 32
or rcx, r11
pxor xmm7, xmm7
mov QWORD PTR [rsp-40], rcx
mov ecx, edx
mov rdx, QWORD PTR [rsp-32]
and rdx, rdi
or rdx, rcx
lea ecx, [r10+2]
mov edx, edx
cvtsi2ss xmm7, ecx
movd ecx, xmm7
sal rcx, 32
or rdx, rcx
mov QWORD PTR [rsp-32], rdx
movaps xmm0, XMMWORD PTR [rsp-40]
mulps xmm0, xmm0
addps xmm0, xmm4
movaps xmm3, xmm0
movaps xmm1, xmm0
shufps xmm3, xmm0, 85
addss xmm1, xmm3
movaps xmm3, xmm0
unpckhps xmm3, xmm0
shufps xmm0, xmm0, 255
addss xmm1, xmm3
addss xmm0, xmm1
cvttss2si edx, xmm0
add eax, edx
cmp r10d, 5120
jne .L2
cmp r9d, 5120
jne .L4
rep ret
main:
xor eax, eax
ret
_GLOBAL__sub_I_calculations():
sub rsp, 8
mov edi, OFFSET FLAT:_ZStL8__ioinit
call std::ios_base::Init::Init() [complete object constructor]
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:_ZStL8__ioinit
mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
add rsp, 8
jmp __cxa_atexit
Note that the compiler's version is using cvtdq2ps, paddd, cvtdq2ps, mulps, addps, and cvttps2dq. All of these are SIMD instructions. By combining them effectively, the compiler generates fast code.
In constrast, your code generates a lot of add, and, cvtsi2ss, lea, mov, movd, or, pxor, sal, which are not SIMD instructions.
I suspect the compiler does a better job of dealing with data type conversion and data rearrangement than you do, and that this allows it to arrange its math more effectively.

Fetch component of std::complex as reference

Do std::real(my_complex) and my_complex.real() make copies of the real part? Is there a way I can access by reference instead of value?
For background,
I am writing some performance critical code. Within tight loops I have to do some complex * real multiplies. I found it is faster to do two real multiplies than a complex multiply, because I know one of the operands is real. To support real multiplies, I store my complex data as SOA, std::complex<std::vector<short>>. Maybe this is a bad idea but I thought it would make it obvious to the reader that this is complex data stored as structure of arrays.
Anyway, in tight loop I do something like the following:
std::real(complex_data)[0] * all_real_data[0]
std::imag(complex_data)[0] * all_real_data[0]
Turns out the real and imag lookups are big offender in the CPU usage report.
I tried complex_data.real()[0] * all_real_data[0], but it seems to be no different.
I then abstracted the real/imag deference out of the loop like
std::vector<short>& my_complex_real = std::real(complex_data) and it is 2x faster.
I guess subquestion is "Is SOA inside a std::complex a bad idea?"
Both std::real and std::complex::real give you the real part by value, which means they make a copy.
The only way you can access the real and imaginary parts of a std::complex<T> is to cast it into an array. If you have
std::complex<T> foo;
Then
reinterpret_cast<T(&)[2]>(foo)[0]
gives you a reference to the real part and
reinterpret_cast<T(&)[2]>(foo)[1]
gives you a reference to the imaginary part. This is mandated to work per the standard ([complex.numbers]/4) so it is not undefined behavior.
You should also note the std::complex is only defined for std::complex<float>, std::complex<double>, and std::complex<long double>. Any other instantiation is unspecified per [complex.numbers]/2
I don't think the SOA idea here will be particularly productive. I assume you are having to put in global arithmetic overloads for the std::vector to make this work. But internally this also means there are two resizable vectors and two extra pointers, which is a fair bit of overhead for the kind of applications where SOA vs AOS is important. It is also gives the reason there is significant cost in extracting the real part: the vector itself is almost certainly being copied.
#NathanOliver's answer above gives a way to get a pointer to the std::complex as an array, which will likely save the copying, but I expect you will want to at least use a custom class instead of std::vector<short>. Realistically complex arithmetic is simple enough to implement that it may be faster to just do that part yourself.
(Daniel H's answer is better than mine in indicating it isn't allowed by the spec and calling out cache locality specifically. You really don't want to do this.)
Using std::complex<std::vector<short>> is unspecified behavior. The only allowed specializations, unless you have a compiler extension, are specializations std::complex<float>, std::complex<double>, and std::complex<long double>. Other arithmetic types, like std::complex<short>, are at least more likely to have sane results in practice even if they don’t have any stronger requirements in theory.
Because of cache locality, I would expect that std::vector<std::complex<short>> would have better performance, even if both types happen to work well in your implementation.
Either way, as NathanOliver points out above, reinterpret_cast<T(&)[2]>(z)[0] and reinterpret_cast<T(&)[2]>(z)[1] should give references to the real and imaginary parts, but note that complex numbers define an operator* for multiplying by the real type, so this shouldn’t be necessary.
So I knocked up this little example on godbolt.
I'm struggling to see the problem with taking copies:
#include <complex>
#include <array>
std::complex<double> foo(std::complex<double> (& complex_data)[10], double (&all_real_data)[10], int i)
{
return std::complex<double>(std::real(complex_data[i]) * all_real_data[i],
std::imag(complex_data[i]) * all_real_data[i]);
}
std::array<std::complex<double>, 10>
calc(std::complex<double> (& complex_data)[10], double (&all_real_data)[10])
{
std::array<std::complex<double>, 10> result;
for (int i = 0 ; i < 10 ; ++i)
{
result[i] = foo(complex_data, all_real_data, i);
}
return result;
}
compile with -O3 on gcc yields for calc:
calc(std::complex<double> (&) [10], double (&) [10]):
movsd xmm0, QWORD PTR [rdx]
mov rax, rdi
movsd xmm1, QWORD PTR [rsi+8]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi]
movsd QWORD PTR [rdi+8], xmm1
movsd xmm1, QWORD PTR [rsi+24]
movsd QWORD PTR [rdi], xmm0
movsd xmm0, QWORD PTR [rdx+8]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+16]
movsd QWORD PTR [rdi+24], xmm1
movsd xmm1, QWORD PTR [rsi+40]
movsd QWORD PTR [rdi+16], xmm0
movsd xmm0, QWORD PTR [rdx+16]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+32]
movsd QWORD PTR [rdi+40], xmm1
movsd xmm1, QWORD PTR [rsi+56]
movsd QWORD PTR [rdi+32], xmm0
movsd xmm0, QWORD PTR [rdx+24]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+48]
movsd QWORD PTR [rdi+56], xmm1
movsd xmm1, QWORD PTR [rsi+72]
movsd QWORD PTR [rdi+48], xmm0
movsd xmm0, QWORD PTR [rdx+32]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+64]
movsd QWORD PTR [rdi+72], xmm1
movsd xmm1, QWORD PTR [rsi+88]
movsd QWORD PTR [rdi+64], xmm0
movsd xmm0, QWORD PTR [rdx+40]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+80]
movsd QWORD PTR [rdi+88], xmm1
movsd xmm1, QWORD PTR [rsi+104]
movsd QWORD PTR [rdi+80], xmm0
movsd xmm0, QWORD PTR [rdx+48]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+96]
movsd QWORD PTR [rdi+104], xmm1
movsd xmm1, QWORD PTR [rsi+120]
movsd QWORD PTR [rdi+96], xmm0
movsd xmm0, QWORD PTR [rdx+56]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+112]
movsd QWORD PTR [rdi+120], xmm1
movsd xmm1, QWORD PTR [rsi+136]
movsd QWORD PTR [rdi+112], xmm0
movsd xmm0, QWORD PTR [rdx+64]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+128]
movsd QWORD PTR [rdi+136], xmm1
movsd xmm1, QWORD PTR [rsi+152]
movsd QWORD PTR [rdi+128], xmm0
movsd xmm0, QWORD PTR [rdx+72]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+144]
movsd QWORD PTR [rdi+152], xmm1
movsd QWORD PTR [rdi+144], xmm0
ret
with -march=native we touch memory fewer times
calc(std::complex<double> (&) [10], double (&) [10]):
vmovupd ymm1, YMMWORD PTR [rsi]
vmovupd ymm0, YMMWORD PTR [rsi+32]
mov rax, rdi
vmovupd ymm3, YMMWORD PTR [rdx]
vunpckhpd ymm2, ymm1, ymm0
vunpcklpd ymm0, ymm1, ymm0
vpermpd ymm2, ymm2, 216
vpermpd ymm0, ymm0, 216
vmulpd ymm0, ymm0, ymm3
vmulpd ymm2, ymm2, ymm3
vpermpd ymm1, ymm0, 68
vpermpd ymm0, ymm0, 238
vpermpd ymm3, ymm2, 68
vpermpd ymm2, ymm2, 238
vshufpd ymm1, ymm1, ymm3, 12
vshufpd ymm0, ymm0, ymm2, 12
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd ymm1, YMMWORD PTR [rsi+64]
vmovupd YMMWORD PTR [rdi+32], ymm0
vmovupd ymm0, YMMWORD PTR [rsi+96]
vmovupd ymm3, YMMWORD PTR [rdx+32]
vunpckhpd ymm2, ymm1, ymm0
vunpcklpd ymm0, ymm1, ymm0
vpermpd ymm2, ymm2, 216
vpermpd ymm0, ymm0, 216
vmulpd ymm0, ymm0, ymm3
vmulpd ymm2, ymm2, ymm3
vpermpd ymm1, ymm0, 68
vpermpd ymm0, ymm0, 238
vpermpd ymm3, ymm2, 68
vpermpd ymm2, ymm2, 238
vshufpd ymm1, ymm1, ymm3, 12
vshufpd ymm0, ymm0, ymm2, 12
vmovupd YMMWORD PTR [rdi+64], ymm1
vmovupd YMMWORD PTR [rdi+96], ymm0
vmovsd xmm0, QWORD PTR [rdx+64]
vmulsd xmm1, xmm0, QWORD PTR [rsi+136]
vmulsd xmm0, xmm0, QWORD PTR [rsi+128]
vmovsd QWORD PTR [rdi+136], xmm1
vmovsd QWORD PTR [rdi+128], xmm0
vmovsd xmm0, QWORD PTR [rdx+72]
vmulsd xmm1, xmm0, QWORD PTR [rsi+152]
vmulsd xmm0, xmm0, QWORD PTR [rsi+144]
vmovsd QWORD PTR [rdi+152], xmm1
vmovsd QWORD PTR [rdi+144], xmm0
vzeroupper
ret

Bug in VC++ 14.0 (2015) compiler?

I've been running into some issues that only occurred during Release x86 mode and not during Release x64 or any Debug mode. I managed to reproduce the bug using the following code:
#include <stdio.h>
#include <iostream>
using namespace std;
struct WMatrix {
float _11, _12, _13, _14;
float _21, _22, _23, _24;
float _31, _32, _33, _34;
float _41, _42, _43, _44;
WMatrix(float f11, float f12, float f13, float f14,
float f21, float f22, float f23, float f24,
float f31, float f32, float f33, float f34,
float f41, float f42, float f43, float f44) :
_11(f11), _12(f12), _13(f13), _14(f14),
_21(f21), _22(f22), _23(f23), _24(f24),
_31(f31), _32(f32), _33(f33), _34(f34),
_41(f41), _42(f42), _43(f43), _44(f44) {
}
};
void printmtx(WMatrix m1) {
char str[256];
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._11, m1._12, m1._13, m1._14);
cout << str << "\n";
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._21, m1._22, m1._23, m1._24);
cout << str << "\n";
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._31, m1._32, m1._33, m1._34);
cout << str << "\n";
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._41, m1._42, m1._43, m1._44);
cout << str << "\n";
}
WMatrix mul1(WMatrix m, float f) {
WMatrix out = m;
for (unsigned int i = 0; i < 4; i++) {
for (unsigned int j = 0; j < 4; j++) {
unsigned int idx = i * 4 + j; // critical code
*(&out._11 + idx) *= f; // critical code
}
}
return out;
}
WMatrix mul2(WMatrix m, float f) {
WMatrix out = m;
unsigned int idx2 = 0;
for (unsigned int i = 0; i < 4; i++) {
for (unsigned int j = 0; j < 4; j++) {
unsigned int idx = i * 4 + j; // critical code
bool b = idx == idx2; // critical code
*(&out._11 + idx) *= f; // critical code
idx2++;
}
}
return out;
}
int main() {
WMatrix m1(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
WMatrix m2 = mul1(m1, 0.5f);
WMatrix m3 = mul2(m1, 0.5f);
printmtx(m1);
cout << "\n";
printmtx(m2);
cout << "\n";
printmtx(m3);
int x;
cin >> x;
}
In the above code, mul2 works, but mul1 does not. mul1 and mul2 are simply trying to iterate over the floats in the WMatrix and multiply them by f, but the way mul1 indexes (i*4+j) somehow evaluates to incorrect results. All mul2 does different is it checks the index before using it and then it works (there are many other ways of tinkering with the index to make it work). Notice if you remove the line "bool b = idx == idx2" then mul2 also breaks...
Here is the output:
1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000
0.500, 0.500, 0.375, 0.250
0.625, 1.500, 3.500, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000
0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000
Correct output should be...
1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000
0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000
0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000
Am I missing something? Or is it actually a bug in the compiler?
This afflicts only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem manifest in 32-bit builds whether optimizing for speed (/O2) or size (/O1). As you mentioned, it works as expected in debugging builds with optimization disabled.
Wimmel's suggestion of changing the packing, accurate though it is, does not change the behavior. (The code below assumes the packing is correctly set to 1 for WMatrix.)
I can't reproduce it in VS 2010, but I can in VS 2013 and 2015. I don't have 2012 installed. That's good enough, though, to allow us to analyze the difference between the object code produced by the two compilers.
Here is the code for mul1 from VS 2010 (the "working" code):
(Actually, in many cases, the compiler inlined the code from this function at the call site. But the compiler will still output disassembly files containing the code it generated for the individual functions prior to inlining. That's what we're looking at here, because it is more cluttered. The behavior of the code is entirely equivalent whether it's been inlined or not.)
PUBLIC mul1
_TEXT SEGMENT
_m$ = 8 ; size = 64
_f$ = 72 ; size = 4
mul1 PROC
___$ReturnUdt$ = eax
push esi
push edi
; WMatrix out = m;
mov ecx, 16 ; 00000010H
lea esi, DWORD PTR _m$[esp+4]
mov edi, eax
rep movsd
; for (unsigned int i = 0; i < 4; i++)
; {
; for (unsigned int j = 0; j < 4; j++)
; {
; unsigned int idx = i * 4 + j; // critical code
; *(&out._11 + idx) *= f; // critical code
movss xmm0, DWORD PTR [eax]
cvtps2pd xmm1, xmm0
movss xmm0, DWORD PTR _f$[esp+4]
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax], xmm1
movss xmm1, DWORD PTR [eax+4]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+4], xmm1
movss xmm1, DWORD PTR [eax+8]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+8], xmm1
movss xmm1, DWORD PTR [eax+12]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+12], xmm1
movss xmm2, DWORD PTR [eax+16]
cvtps2pd xmm2, xmm2
cvtps2pd xmm1, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+16], xmm1
movss xmm1, DWORD PTR [eax+20]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+20], xmm1
movss xmm1, DWORD PTR [eax+24]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+24], xmm1
movss xmm1, DWORD PTR [eax+28]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+28], xmm1
movss xmm1, DWORD PTR [eax+32]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+32], xmm1
movss xmm1, DWORD PTR [eax+36]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+36], xmm1
movss xmm2, DWORD PTR [eax+40]
cvtps2pd xmm2, xmm2
cvtps2pd xmm1, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+40], xmm1
movss xmm1, DWORD PTR [eax+44]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+44], xmm1
movss xmm2, DWORD PTR [eax+48]
cvtps2pd xmm1, xmm0
cvtps2pd xmm2, xmm2
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+48], xmm1
movss xmm1, DWORD PTR [eax+52]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+52], xmm1
movss xmm1, DWORD PTR [eax+56]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
cvtps2pd xmm0, xmm0
movss DWORD PTR [eax+56], xmm1
movss xmm1, DWORD PTR [eax+60]
cvtps2pd xmm1, xmm1
mulsd xmm1, xmm0
pop edi
cvtpd2ps xmm0, xmm1
movss DWORD PTR [eax+60], xmm0
pop esi
; return out;
ret 0
mul1 ENDP
Compare that to the code for mul1 generated by VS 2015:
mul1 PROC
_m$ = 8 ; size = 64
; ___$ReturnUdt$ = ecx
; _f$ = xmm2s
; WMatrix out = m;
movups xmm0, XMMWORD PTR _m$[esp-4]
; for (unsigned int i = 0; i < 4; i++)
xor eax, eax
movaps xmm1, xmm2
movups XMMWORD PTR [ecx], xmm0
movups xmm0, XMMWORD PTR _m$[esp+12]
shufps xmm1, xmm1, 0
movups XMMWORD PTR [ecx+16], xmm0
movups xmm0, XMMWORD PTR _m$[esp+28]
movups XMMWORD PTR [ecx+32], xmm0
movups xmm0, XMMWORD PTR _m$[esp+44]
movups XMMWORD PTR [ecx+48], xmm0
npad 4
$LL4#mul1:
; for (unsigned int j = 0; j < 4; j++)
; {
; unsigned int idx = i * 4 + j; // critical code
; *(&out._11 + idx) *= f; // critical code
movups xmm0, XMMWORD PTR [ecx+eax*4]
mulps xmm0, xmm1
movups XMMWORD PTR [ecx+eax*4], xmm0
inc eax
cmp eax, 4
jb SHORT $LL4#mul1
; return out;
mov eax, ecx
ret 0
?mul1##YA?AUWMatrix##U1#M#Z ENDP ; mul1
_TEXT ENDS
It is immediately obvious how much shorter the code is. Apparently the optimizer got a lot smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer's "smarts" is the exploitation of bugs in your code.
Looking at the code that matches up with the loops, you can see that VS 2010 is unrolling the loops. All of the computations are done inline so that there are no branches. This is kind of what you'd expect for loops with upper and lower bounds that are known at compile time and, as in this case, reasonably small.
What happened in VS 2015? Well, it didn't unroll anything. There are 5 lines of code, and then a conditional jump JB back to the top of the loop sequence. That alone doesn't tell you much. What does look highly suspicious is that it only loops 4 times (see the cmp eax, 4 statement that sets flags right before doing the jb, effectively continuing the loop as long as the counter is less than 4). Well, that might be okay if it had merged the two loops into one. Let's see what it's doing inside of the loop:
$LL4#mul1:
movups xmm0, XMMWORD PTR [ecx+eax*4] ; load a packed unaligned value into XMM0
mulps xmm0, xmm1 ; do a packed multiplication of XMM0 by XMM1,
; storing the result in XMM0
movups XMMWORD PTR [ecx+eax*4], xmm0 ; store the result of the previous multiplication
; back into the memory location that we
; initially loaded from
inc eax ; one iteration done, increment loop counter
cmp eax, 4 ; see how many loops we've done
jb $LL4#mul1 ; keep looping if < 4 iterations
The code reads a value from memory (an XMM-sized value from the location determined by ecx + eax * 4) into XMM0, multiplies it by a value in XMM1 (which was set outside the loop, based on the f parameter), and then stores the result back into the original memory location.
Compare that to the code for the corresponding loop in mul2:
$LL4#mul2:
lea eax, DWORD PTR [eax+16]
movups xmm0, XMMWORD PTR [eax-24]
mulps xmm0, xmm2
movups XMMWORD PTR [eax-24], xmm0
sub ecx, 1
jne $LL4#mul2
Aside from a different loop control sequence (this sets ECX to 4 outside of the loop, subtracts 1 each time through, and keeps looping as long as ECX != 0), the big difference here is the actual XMM values that it manipulates in memory. Instead of loading from [ecx+eax*4], it loads from [eax-24] (after having previously added 16 to EAX).
What's different about mul2? You had added code to track a separate index in idx2, incrementing it each time through the loop. Now, this alone would not be enough. If you comment out the assignment to the bool variable b, mul1 and mul2 result in identical object code. Clearly without the comparison of idx to idx2, the compiler is able to deduce that idx2 is completely unused, and therefore eliminate it, turning mul2 into mul1. But with that comparison, the compiler apparently becomes unable to eliminate idx2, and its presence ever so slightly changes what optimizations are deemed possible for the function, resulting in the output discrepancy.
Now the question turns to why is this happening. Is it an optimizer bug, as you first suspected? Well, no—and as some of the commenters have mentioned, it should never be your first instinct to blame the compiler/optimizer. Always assume that there are bugs in your code unless you can prove otherwise. That proof would always involve looking at the disassembly, and preferably referencing the relevant portions of the language standard if you really want to be taken seriously.
In this case, Mystical has already nailed the problem. Your code exhibits undefined behavior when it does *(&out._11 + idx). This makes certain assumptions about the layout of the WMatrix struct in memory, which you cannot legally make, even after explicitly setting the packing.
This is why undefined behavior is evil—it results in code that seems to work sometimes, but other times it doesn't. It is very sensitive to compiler flags, especially optimizations, but also target platforms (as we saw at the top of this answer). mul2 only works by accident. Both mul1 and mul2 are wrong. Unfortunately, the bug is in your code. Worse, the compiler didn't issue a warning that might have alerted you to your use of undefined behavior.
If we look at the generated code, the problem is fairly clear. Ignoring a few bits and pieces that aren't related to the problem at hand, mul1 produces code like this:
movss xmm1, DWORD PTR _f$[esp-4] ; load xmm1 from _11 of source
; ...
shufps xmm1, xmm1, 0 ; duplicate _11 across floats of xmm1
; ...
for ecx = 0 to 3 {
movups xmm0, XMMWORD PTR [dest+ecx*4] ; load 4 floats from dest
mulps xmm0, xmm1 ; multiply each by _11
movups XMMWORD PTR [dest+ecx*4], xmm0 ; store result back to dest
}
So, instead of multiplying each element of one matrix by the corresponding element of the other matrix, it's multiplying each element of one matrix by _11 of the other matrix.
Although it's impossible to confirm exactly how it happened (without looking through the compiler's source code), this certainly fits with #Mysticial's guess about how the problem arose.