Fetch component of std::complex as reference - c++
Do std::real(my_complex) and my_complex.real() make copies of the real part? Is there a way I can access by reference instead of value?
For background,
I am writing some performance critical code. Within tight loops I have to do some complex * real multiplies. I found it is faster to do two real multiplies than a complex multiply, because I know one of the operands is real. To support real multiplies, I store my complex data as SOA, std::complex<std::vector<short>>. Maybe this is a bad idea but I thought it would make it obvious to the reader that this is complex data stored as structure of arrays.
Anyway, in tight loop I do something like the following:
std::real(complex_data)[0] * all_real_data[0]
std::imag(complex_data)[0] * all_real_data[0]
Turns out the real and imag lookups are big offender in the CPU usage report.
I tried complex_data.real()[0] * all_real_data[0], but it seems to be no different.
I then abstracted the real/imag deference out of the loop like
std::vector<short>& my_complex_real = std::real(complex_data) and it is 2x faster.
I guess subquestion is "Is SOA inside a std::complex a bad idea?"
Both std::real and std::complex::real give you the real part by value, which means they make a copy.
The only way you can access the real and imaginary parts of a std::complex<T> is to cast it into an array. If you have
std::complex<T> foo;
Then
reinterpret_cast<T(&)[2]>(foo)[0]
gives you a reference to the real part and
reinterpret_cast<T(&)[2]>(foo)[1]
gives you a reference to the imaginary part. This is mandated to work per the standard ([complex.numbers]/4) so it is not undefined behavior.
You should also note the std::complex is only defined for std::complex<float>, std::complex<double>, and std::complex<long double>. Any other instantiation is unspecified per [complex.numbers]/2
I don't think the SOA idea here will be particularly productive. I assume you are having to put in global arithmetic overloads for the std::vector to make this work. But internally this also means there are two resizable vectors and two extra pointers, which is a fair bit of overhead for the kind of applications where SOA vs AOS is important. It is also gives the reason there is significant cost in extracting the real part: the vector itself is almost certainly being copied.
#NathanOliver's answer above gives a way to get a pointer to the std::complex as an array, which will likely save the copying, but I expect you will want to at least use a custom class instead of std::vector<short>. Realistically complex arithmetic is simple enough to implement that it may be faster to just do that part yourself.
(Daniel H's answer is better than mine in indicating it isn't allowed by the spec and calling out cache locality specifically. You really don't want to do this.)
Using std::complex<std::vector<short>> is unspecified behavior. The only allowed specializations, unless you have a compiler extension, are specializations std::complex<float>, std::complex<double>, and std::complex<long double>. Other arithmetic types, like std::complex<short>, are at least more likely to have sane results in practice even if they don’t have any stronger requirements in theory.
Because of cache locality, I would expect that std::vector<std::complex<short>> would have better performance, even if both types happen to work well in your implementation.
Either way, as NathanOliver points out above, reinterpret_cast<T(&)[2]>(z)[0] and reinterpret_cast<T(&)[2]>(z)[1] should give references to the real and imaginary parts, but note that complex numbers define an operator* for multiplying by the real type, so this shouldn’t be necessary.
So I knocked up this little example on godbolt.
I'm struggling to see the problem with taking copies:
#include <complex>
#include <array>
std::complex<double> foo(std::complex<double> (& complex_data)[10], double (&all_real_data)[10], int i)
{
return std::complex<double>(std::real(complex_data[i]) * all_real_data[i],
std::imag(complex_data[i]) * all_real_data[i]);
}
std::array<std::complex<double>, 10>
calc(std::complex<double> (& complex_data)[10], double (&all_real_data)[10])
{
std::array<std::complex<double>, 10> result;
for (int i = 0 ; i < 10 ; ++i)
{
result[i] = foo(complex_data, all_real_data, i);
}
return result;
}
compile with -O3 on gcc yields for calc:
calc(std::complex<double> (&) [10], double (&) [10]):
movsd xmm0, QWORD PTR [rdx]
mov rax, rdi
movsd xmm1, QWORD PTR [rsi+8]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi]
movsd QWORD PTR [rdi+8], xmm1
movsd xmm1, QWORD PTR [rsi+24]
movsd QWORD PTR [rdi], xmm0
movsd xmm0, QWORD PTR [rdx+8]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+16]
movsd QWORD PTR [rdi+24], xmm1
movsd xmm1, QWORD PTR [rsi+40]
movsd QWORD PTR [rdi+16], xmm0
movsd xmm0, QWORD PTR [rdx+16]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+32]
movsd QWORD PTR [rdi+40], xmm1
movsd xmm1, QWORD PTR [rsi+56]
movsd QWORD PTR [rdi+32], xmm0
movsd xmm0, QWORD PTR [rdx+24]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+48]
movsd QWORD PTR [rdi+56], xmm1
movsd xmm1, QWORD PTR [rsi+72]
movsd QWORD PTR [rdi+48], xmm0
movsd xmm0, QWORD PTR [rdx+32]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+64]
movsd QWORD PTR [rdi+72], xmm1
movsd xmm1, QWORD PTR [rsi+88]
movsd QWORD PTR [rdi+64], xmm0
movsd xmm0, QWORD PTR [rdx+40]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+80]
movsd QWORD PTR [rdi+88], xmm1
movsd xmm1, QWORD PTR [rsi+104]
movsd QWORD PTR [rdi+80], xmm0
movsd xmm0, QWORD PTR [rdx+48]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+96]
movsd QWORD PTR [rdi+104], xmm1
movsd xmm1, QWORD PTR [rsi+120]
movsd QWORD PTR [rdi+96], xmm0
movsd xmm0, QWORD PTR [rdx+56]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+112]
movsd QWORD PTR [rdi+120], xmm1
movsd xmm1, QWORD PTR [rsi+136]
movsd QWORD PTR [rdi+112], xmm0
movsd xmm0, QWORD PTR [rdx+64]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+128]
movsd QWORD PTR [rdi+136], xmm1
movsd xmm1, QWORD PTR [rsi+152]
movsd QWORD PTR [rdi+128], xmm0
movsd xmm0, QWORD PTR [rdx+72]
mulsd xmm1, xmm0
mulsd xmm0, QWORD PTR [rsi+144]
movsd QWORD PTR [rdi+152], xmm1
movsd QWORD PTR [rdi+144], xmm0
ret
with -march=native we touch memory fewer times
calc(std::complex<double> (&) [10], double (&) [10]):
vmovupd ymm1, YMMWORD PTR [rsi]
vmovupd ymm0, YMMWORD PTR [rsi+32]
mov rax, rdi
vmovupd ymm3, YMMWORD PTR [rdx]
vunpckhpd ymm2, ymm1, ymm0
vunpcklpd ymm0, ymm1, ymm0
vpermpd ymm2, ymm2, 216
vpermpd ymm0, ymm0, 216
vmulpd ymm0, ymm0, ymm3
vmulpd ymm2, ymm2, ymm3
vpermpd ymm1, ymm0, 68
vpermpd ymm0, ymm0, 238
vpermpd ymm3, ymm2, 68
vpermpd ymm2, ymm2, 238
vshufpd ymm1, ymm1, ymm3, 12
vshufpd ymm0, ymm0, ymm2, 12
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd ymm1, YMMWORD PTR [rsi+64]
vmovupd YMMWORD PTR [rdi+32], ymm0
vmovupd ymm0, YMMWORD PTR [rsi+96]
vmovupd ymm3, YMMWORD PTR [rdx+32]
vunpckhpd ymm2, ymm1, ymm0
vunpcklpd ymm0, ymm1, ymm0
vpermpd ymm2, ymm2, 216
vpermpd ymm0, ymm0, 216
vmulpd ymm0, ymm0, ymm3
vmulpd ymm2, ymm2, ymm3
vpermpd ymm1, ymm0, 68
vpermpd ymm0, ymm0, 238
vpermpd ymm3, ymm2, 68
vpermpd ymm2, ymm2, 238
vshufpd ymm1, ymm1, ymm3, 12
vshufpd ymm0, ymm0, ymm2, 12
vmovupd YMMWORD PTR [rdi+64], ymm1
vmovupd YMMWORD PTR [rdi+96], ymm0
vmovsd xmm0, QWORD PTR [rdx+64]
vmulsd xmm1, xmm0, QWORD PTR [rsi+136]
vmulsd xmm0, xmm0, QWORD PTR [rsi+128]
vmovsd QWORD PTR [rdi+136], xmm1
vmovsd QWORD PTR [rdi+128], xmm0
vmovsd xmm0, QWORD PTR [rdx+72]
vmulsd xmm1, xmm0, QWORD PTR [rsi+152]
vmulsd xmm0, xmm0, QWORD PTR [rsi+144]
vmovsd QWORD PTR [rdi+152], xmm1
vmovsd QWORD PTR [rdi+144], xmm0
vzeroupper
ret
Related
Auto-vectorization for hand-unrolled initialized tiled-computation versus simple loop with no initialization
In optimization for an AABB collision detection algorithm's inner-most 4-versus-4 comparison part, I am stuck at simplifying code at the same time gaining(or just retaining) performance. Here is the version with hand-unrolled initialization: https://godbolt.org/z/TMGMhdsss inline const int intersectDim(const float minx, const float maxx, const float minx2, const float maxx2) noexcept { return !((maxx < minx2) || (maxx2 < minx)); } inline void comp4vs4( const int * const __restrict__ partId1, const int * const __restrict__ partId2, const float * const __restrict__ minx1, const float * const __restrict__ minx2, const float * const __restrict__ miny1, const float * const __restrict__ miny2, const float * const __restrict__ minz1, const float * const __restrict__ minz2, const float * const __restrict__ maxx1, const float * const __restrict__ maxx2, const float * const __restrict__ maxy1, const float * const __restrict__ maxy2, const float * const __restrict__ maxz1, const float * const __restrict__ maxz2, int * const __restrict__ out ) { alignas(32) int result[16]={ // 0v0 0v1 0v2 0v3 // 1v0 1v1 1v2 1v3 // 2v0 2v1 2v2 2v3 // 3v0 3v1 3v2 3v3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; alignas(32) int tileId1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 partId1[0],partId1[1],partId1[2],partId1[3], partId1[0],partId1[1],partId1[2],partId1[3], partId1[0],partId1[1],partId1[2],partId1[3], partId1[0],partId1[1],partId1[2],partId1[3] }; alignas(32) int tileId2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 partId2[0],partId2[0],partId2[0],partId2[0], partId2[1],partId2[1],partId2[1],partId2[1], partId2[2],partId2[2],partId2[2],partId2[2], partId2[3],partId2[3],partId2[3],partId2[3] }; alignas(32) float tileMinX1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 minx1[0],minx1[1],minx1[2],minx1[3], minx1[0],minx1[1],minx1[2],minx1[3], minx1[0],minx1[1],minx1[2],minx1[3], minx1[0],minx1[1],minx1[2],minx1[3] }; alignas(32) float tileMinX2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 minx2[0],minx2[0],minx2[0],minx2[0], minx2[1],minx2[1],minx2[1],minx2[1], minx2[2],minx2[2],minx2[2],minx2[2], minx2[3],minx2[3],minx2[3],minx2[3] }; alignas(32) float tileMinY1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 miny1[0],miny1[1],miny1[2],miny1[3], miny1[0],miny1[1],miny1[2],miny1[3], miny1[0],miny1[1],miny1[2],miny1[3], miny1[0],miny1[1],miny1[2],miny1[3] }; alignas(32) float tileMinY2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 miny2[0],miny2[0],miny2[0],miny2[0], miny2[1],miny2[1],miny2[1],miny2[1], miny2[2],miny2[2],miny2[2],miny2[2], miny2[3],miny2[3],miny2[3],miny2[3] }; alignas(32) float tileMinZ1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 minz1[0],minz1[1],minz1[2],minz1[3], minz1[0],minz1[1],minz1[2],minz1[3], minz1[0],minz1[1],minz1[2],minz1[3], minz1[0],minz1[1],minz1[2],minz1[3] }; alignas(32) float tileMinZ2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 minz2[0],minz2[0],minz2[0],minz2[0], minz2[1],minz2[1],minz2[1],minz2[1], minz2[2],minz2[2],minz2[2],minz2[2], minz2[3],minz2[3],minz2[3],minz2[3] }; alignas(32) float tileMaxX1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 maxx1[0],maxx1[1],maxx1[2],maxx1[3], maxx1[0],maxx1[1],maxx1[2],maxx1[3], maxx1[0],maxx1[1],maxx1[2],maxx1[3], maxx1[0],maxx1[1],maxx1[2],maxx1[3] }; alignas(32) float tileMaxX2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 maxx2[0],maxx2[0],maxx2[0],maxx2[0], maxx2[1],maxx2[1],maxx2[1],maxx2[1], maxx2[2],maxx2[2],maxx2[2],maxx2[2], maxx2[3],maxx2[3],maxx2[3],maxx2[3] }; alignas(32) float tileMaxY1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 maxy1[0],maxy1[1],maxy1[2],maxy1[3], maxy1[0],maxy1[1],maxy1[2],maxy1[3], maxy1[0],maxy1[1],maxy1[2],maxy1[3], maxy1[0],maxy1[1],maxy1[2],maxy1[3] }; alignas(32) float tileMaxY2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 maxy2[0],maxy2[0],maxy2[0],maxy2[0], maxy2[1],maxy2[1],maxy2[1],maxy2[1], maxy2[2],maxy2[2],maxy2[2],maxy2[2], maxy2[3],maxy2[3],maxy2[3],maxy2[3] }; alignas(32) float tileMaxZ1[16]={ // 0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3 maxz1[0],maxz1[1],maxz1[2],maxz1[3], maxz1[0],maxz1[1],maxz1[2],maxz1[3], maxz1[0],maxz1[1],maxz1[2],maxz1[3], maxz1[0],maxz1[1],maxz1[2],maxz1[3] }; alignas(32) float tileMaxZ2[16]={ // 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 maxz2[0],maxz2[0],maxz2[0],maxz2[0], maxz2[1],maxz2[1],maxz2[1],maxz2[1], maxz2[2],maxz2[2],maxz2[2],maxz2[2], maxz2[3],maxz2[3],maxz2[3],maxz2[3] }; for(int i=0;i<16;i++) result[i] = (tileId1[i] < tileId2[i]); for(int i=0;i<16;i++) result[i] = result[i] && intersectDim(tileMinX1[i], tileMaxX1[i], tileMinX2[i], tileMaxX2[i]) && intersectDim(tileMinY1[i], tileMaxY1[i], tileMinY2[i], tileMaxY2[i]) && intersectDim(tileMinZ1[i], tileMaxZ1[i], tileMinZ2[i], tileMaxZ2[i]); for(int i=0;i<16;i++) out[i]=result[i]; } #include<iostream> int main() { int tile1[4];int tile2[4]; float tile3[4];float tile4[4]; float tile5[4];float tile6[4]; float tile7[4];float tile8[4]; float tile9[4];float tile10[4]; float tile11[4];float tile12[4]; float tile13[4];float tile14[4]; for(int i=0;i<4;i++) { std::cin>>tile1[i]; std::cin>>tile2[i]; std::cin>>tile3[i]; std::cin>>tile4[i]; std::cin>>tile5[i]; std::cin>>tile6[i]; std::cin>>tile7[i]; std::cin>>tile8[i]; std::cin>>tile9[i]; std::cin>>tile10[i]; std::cin>>tile11[i]; std::cin>>tile12[i]; std::cin>>tile13[i]; std::cin>>tile14[i]; } int out[16]; comp4vs4(tile1,tile2,tile3,tile4,tile5,tile6,tile7,tile8,tile9, tile10,tile11,tile12,tile13,tile14,out); for(int i=0;i<16;i++) std::cout<<out[i]; return 0; } and its output from godbolt: comp4vs4(int const*, int const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, float const*, int*): push rbp mov rbp, rsp and rsp, -32 sub rsp, 8 mov rax, QWORD PTR [rbp+80] vmovups xmm0, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+16] vmovups xmm6, XMMWORD PTR [rcx] vmovups xmm5, XMMWORD PTR [r9] vmovups xmm9, XMMWORD PTR [r8] vmovdqu xmm15, XMMWORD PTR [rsi] vmovdqu xmm8, XMMWORD PTR [rdi] vmovups xmm2, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+24] vpermilps xmm1, xmm6, 0 vmovdqa XMMWORD PTR [rsp-88], xmm15 vmovups xmm4, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+32] vmovups xmm14, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+40] vmovups xmm11, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+48] vcmpleps xmm3, xmm1, xmm14 vmovups xmm13, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+56] vpermilps xmm1, xmm11, 0 vcmpleps xmm1, xmm0, xmm1 vmovups xmm10, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+64] vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm5, 0 vcmpleps xmm3, xmm3, xmm13 vmovups xmm7, XMMWORD PTR [rdx] mov rdx, QWORD PTR [rbp+72] vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm10, 0 vmovaps XMMWORD PTR [rsp-72], xmm7 vcmpleps xmm3, xmm9, xmm3 vmovups xmm7, XMMWORD PTR [rdx] vpand xmm1, xmm1, xmm3 vpshufd xmm3, xmm15, 0 vpcomltd xmm3, xmm8, xmm3 vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm7, 0 vcmpleps xmm12, xmm2, xmm3 vpermilps xmm3, xmm4, 0 vcmpleps xmm3, xmm3, XMMWORD PTR [rsp-72] vpand xmm3, xmm3, xmm12 vmovdqa xmm12, XMMWORD PTR .LC0[rip] vpand xmm3, xmm3, xmm12 vpand xmm1, xmm1, xmm3 vmovdqa XMMWORD PTR [rsp-104], xmm1 vpermilps xmm1, xmm6, 85 vcmpleps xmm3, xmm1, xmm14 vpermilps xmm1, xmm11, 85 vcmpleps xmm1, xmm0, xmm1 vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm5, 85 vcmpleps xmm3, xmm3, xmm13 vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm10, 85 vcmpleps xmm3, xmm9, xmm3 vpand xmm1, xmm1, xmm3 vpshufd xmm3, xmm15, 85 vpermilps xmm15, xmm4, 85 vpcomltd xmm3, xmm8, xmm3 vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm7, 85 vcmpleps xmm15, xmm15, XMMWORD PTR [rsp-72] vcmpleps xmm3, xmm2, xmm3 vpand xmm3, xmm3, xmm15 vpermilps xmm15, xmm4, 170 vpand xmm3, xmm3, xmm12 vpermilps xmm4, xmm4, 255 vcmpleps xmm15, xmm15, XMMWORD PTR [rsp-72] vpand xmm1, xmm1, xmm3 vcmpleps xmm4, xmm4, XMMWORD PTR [rsp-72] vmovdqa XMMWORD PTR [rsp-120], xmm1 vpermilps xmm1, xmm6, 170 vpermilps xmm6, xmm6, 255 vcmpleps xmm3, xmm1, xmm14 vpermilps xmm1, xmm11, 170 vpermilps xmm11, xmm11, 255 vcmpleps xmm6, xmm6, xmm14 vcmpleps xmm1, xmm0, xmm1 vcmpleps xmm11, xmm0, xmm11 vpshufd xmm0, XMMWORD PTR [rsp-88], 255 vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm5, 170 vpermilps xmm5, xmm5, 255 vcmpleps xmm3, xmm3, xmm13 vpand xmm6, xmm11, xmm6 vcmpleps xmm13, xmm5, xmm13 vmovdqa xmm5, XMMWORD PTR [rsp-104] vpand xmm1, xmm1, xmm3 vpermilps xmm3, xmm10, 170 vpermilps xmm10, xmm10, 255 vcmpleps xmm3, xmm9, xmm3 vpand xmm6, xmm6, xmm13 vmovdqu XMMWORD PTR [rax], xmm5 vcmpleps xmm9, xmm9, xmm10 vpand xmm1, xmm1, xmm3 vpshufd xmm3, XMMWORD PTR [rsp-88], 170 vpand xmm9, xmm6, xmm9 vpcomltd xmm3, xmm8, xmm3 vpand xmm1, xmm1, xmm3 vpcomltd xmm8, xmm8, xmm0 vmovdqa xmm0, XMMWORD PTR [rsp-120] vpermilps xmm3, xmm7, 170 vpermilps xmm7, xmm7, 255 vcmpleps xmm3, xmm2, xmm3 vpand xmm8, xmm9, xmm8 vcmpleps xmm2, xmm2, xmm7 vmovdqu XMMWORD PTR [rax+16], xmm0 vpand xmm3, xmm3, xmm15 vpand xmm2, xmm2, xmm4 vpand xmm3, xmm3, xmm12 vpand xmm12, xmm2, xmm12 vpand xmm3, xmm1, xmm3 vpand xmm12, xmm8, xmm12 vmovdqu XMMWORD PTR [rax+32], xmm3 vmovdqu XMMWORD PTR [rax+48], xmm12 leave ret main: // character limit 30k It is ~123 lines of vector instructions. Since it runs somewhat ok performance, I tried to simplify it with simple bitwise operations: https://godbolt.org/z/zKqe49a73 inline const int intersectDim(const float minx, const float maxx, const float minx2, const float maxx2) noexcept { return !((maxx < minx2) || (maxx2 < minx)); } inline void comp4vs4( const int * const __restrict__ partId1, const int * const __restrict__ partId2, const float * const __restrict__ minx1, const float * const __restrict__ minx2, const float * const __restrict__ miny1, const float * const __restrict__ miny2, const float * const __restrict__ minz1, const float * const __restrict__ minz2, const float * const __restrict__ maxx1, const float * const __restrict__ maxx2, const float * const __restrict__ maxy1, const float * const __restrict__ maxy2, const float * const __restrict__ maxz1, const float * const __restrict__ maxz2, int * const __restrict__ out ) { alignas(32) int result[16]={ // 0v0 0v1 0v2 0v3 // 1v0 1v1 1v2 1v3 // 2v0 2v1 2v2 2v3 // 3v0 3v1 3v2 3v3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; for(int i=0;i<16;i++) result[i] = partId1[i&3]<partId2[i/4]; for(int i=0;i<16;i++) result[i] = result[i] && intersectDim(minx1[i&3], maxx1[i&3], minx2[i/4], maxx2[i/4]) && intersectDim(miny1[i&3], maxy1[i&3], miny2[i/4], maxy2[i/4]) && intersectDim(minz1[i&3], maxz1[i&3], minz2[i/4], maxz2[i/4]); for(int i=0;i<16;i++) out[i]=result[i]; } #include<iostream> int main() { int tile1[4];int tile2[4]; float tile3[4];float tile4[4]; float tile5[4];float tile6[4]; float tile7[4];float tile8[4]; float tile9[4];float tile10[4]; float tile11[4];float tile12[4]; float tile13[4];float tile14[4]; for(int i=0;i<4;i++) { std::cin>>tile1[i]; std::cin>>tile2[i]; std::cin>>tile3[i]; std::cin>>tile4[i]; std::cin>>tile5[i]; std::cin>>tile6[i]; std::cin>>tile7[i]; std::cin>>tile8[i]; std::cin>>tile9[i]; std::cin>>tile10[i]; std::cin>>tile11[i]; std::cin>>tile12[i]; std::cin>>tile13[i]; std::cin>>tile14[i]; } int out[16]; comp4vs4(tile1,tile2,tile3,tile4,tile5,tile6,tile7,tile8,tile9, tile10,tile11,tile12,tile13,tile14,out); for(int i=0;i<16;i++) std::cout<<out[i]; return 0; } how godbolt outputs: main: // character limit 30k vpxor xmm0, xmm0, xmm0 vmovdqa xmm3, XMMWORD PTR .LC0[rip] lea rax, [rsp+240] vpxor xmm4, xmm4, xmm4 vmovdqa XMMWORD PTR [rsp+224], xmm0 vmovdqa XMMWORD PTR [rsp+240], xmm0 vmovdqa XMMWORD PTR [rsp+256], xmm0 vmovdqa XMMWORD PTR [rsp+272], xmm0 vpcmpeqd xmm0, xmm0, xmm0 vmovdqa xmm7, xmm0 vmovdqa xmm6, xmm0 vmovdqa xmm5, xmm0 vpgatherdd xmm2, DWORD PTR [rsp+16+xmm4*4], xmm7 vmovdqa xmm4, XMMWORD PTR .LC1[rip] vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm6 vmovdqa xmm7, xmm0 vmovdqa xmm6, xmm0 vpcomltd xmm1, xmm1, xmm2 vpand xmm1, xmm1, xmm4 vmovdqa XMMWORD PTR [rsp+224], xmm1 vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm6 vpgatherdd xmm2, DWORD PTR [rsp+16+xmm4*4], xmm7 vmovdqa xmm6, xmm0 vmovdqa xmm7, xmm0 vpcomltd xmm1, xmm1, xmm2 vpand xmm1, xmm1, xmm4 vmovdqa XMMWORD PTR [rsp+240], xmm1 vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm5 vmovdqa xmm5, XMMWORD PTR .LC2[rip] vpgatherdd xmm2, DWORD PTR [rsp+16+xmm5*4], xmm6 vmovdqa xmm5, XMMWORD PTR .LC3[rip] vmovdqa xmm6, xmm0 vpcomltd xmm1, xmm1, xmm2 vpand xmm1, xmm1, xmm4 vmovdqa XMMWORD PTR [rsp+256], xmm1 vpgatherdd xmm1, DWORD PTR [rdx+xmm3*4], xmm7 vpgatherdd xmm0, DWORD PTR [rsp+16+xmm5*4], xmm6 vmovdqa xmm7, XMMWORD PTR .LC4[rip] vpxor xmm6, xmm6, xmm6 lea rdx, [rsp+304] vpcomltd xmm0, xmm1, xmm0 vpand xmm0, xmm0, xmm4 vmovdqa XMMWORD PTR [rsp+272], xmm0 .L3: vmovdqa xmm0, XMMWORD PTR [rax-16] vmovdqa xmm2, xmm3 prefetcht0 [rax] add rax, 16 vpaddd xmm3, xmm3, xmm7 vpsrad xmm8, xmm2, 2 vpand xmm2, xmm2, xmm5 vpcomneqd xmm1, xmm0, xmm6 vmovaps xmm0, xmm1 vmovaps xmm11, xmm1 vmovaps xmm12, xmm1 vmovaps xmm13, xmm1 vgatherdps xmm11, DWORD PTR [rsp+144+xmm8*4], xmm0 vmovaps xmm14, xmm1 vmovaps xmm0, xmm1 vmovaps xmm10, xmm1 vmovaps xmm9, xmm1 vgatherdps xmm10, DWORD PTR [rsp+128+xmm2*4], xmm13 vgatherdps xmm0, DWORD PTR [r13+0+xmm8*4], xmm12 vgatherdps xmm9, DWORD PTR [rsp+32+xmm2*4], xmm14 vcmpleps xmm0, xmm0, xmm10 vcmpleps xmm9, xmm9, xmm11 vpand xmm0, xmm0, xmm9 vpand xmm1, xmm0, xmm1 vmovaps xmm0, xmm1 vmovaps xmm11, xmm1 vmovaps xmm15, xmm1 vmovaps xmm10, xmm1 vgatherdps xmm11, DWORD PTR [r15+xmm8*4], xmm0 vmovaps xmm12, xmm1 vmovaps xmm0, xmm1 vmovaps xmm9, xmm1 vmovaps xmm13, xmm1 vgatherdps xmm10, DWORD PTR [r12+xmm2*4], xmm12 vgatherdps xmm0, DWORD PTR [rsp+80+xmm8*4], xmm15 vgatherdps xmm9, DWORD PTR [rsp+64+xmm2*4], xmm13 vcmpleps xmm0, xmm0, xmm10 vcmpleps xmm9, xmm9, xmm11 vpand xmm0, xmm0, xmm9 vpand xmm0, xmm0, xmm1 vmovaps xmm1, xmm0 vmovaps xmm10, xmm0 vmovaps xmm9, xmm0 vmovaps xmm14, xmm0 vgatherdps xmm10, DWORD PTR [rsp+208+xmm8*4], xmm1 vmovaps xmm1, xmm0 vgatherdps xmm9, DWORD PTR [r14+xmm8*4], xmm1 vmovaps xmm1, xmm0 vmovaps xmm8, xmm0 vgatherdps xmm8, DWORD PTR [rsp+192+xmm2*4], xmm1 vmovaps xmm1, xmm0 vgatherdps xmm1, DWORD PTR [rsp+96+xmm2*4], xmm14 vcmpleps xmm2, xmm9, xmm8 vcmpleps xmm1, xmm1, xmm10 vpand xmm1, xmm1, xmm2 vpand xmm1, xmm1, xmm4 vpand xmm0, xmm0, xmm1 vmovdqa XMMWORD PTR [rax-32], xmm0 cmp rdx, rax jne .L3 vmovdqa xmm5, XMMWORD PTR [rsp+224] vmovdqa xmm7, XMMWORD PTR [rsp+240] vmovdqa xmm4, XMMWORD PTR [rsp+256] lea rbx, [rsp+288] lea r12, [rsp+352] vmovdqa XMMWORD PTR [rsp+288], xmm5 vmovdqa xmm5, XMMWORD PTR [rsp+272] vmovdqa XMMWORD PTR [rsp+304], xmm7 vmovdqa XMMWORD PTR [rsp+320], xmm4 vmovdqa XMMWORD PTR [rsp+336], xmm5 .L4: // character limit 30k it has ~110 lines of vector instructions. Despite having less instructions than first version, it runs at half performance (at least on bdver1 compiler flag). Is it because of "AND" and division operations for indexing? Also, parameters using the restrict keyword are pointing to same memory occasionally. Could this be a problem for performance? If it helps, here are performance-test source codes on some online-service with avx512-cpu (maximum 32 AABBs per leaf node): Unrolled version: https://rextester.com/YKDN52107 Readable version: https://rextester.com/TAFR72415 (slow) A bit less performance difference when 128 AABBs per leaf node(tested in godbolt server): Unrolled: https://godbolt.org/z/rx13zorjr Readable: https://godbolt.org/z/e1cfbEPKn
Why my SSE code is slower than native C++ code?
First of all, I am new to SSE. I decided to accelerate my code, but it seems, that it works slower, then my native code. This is an example, that calculates the sum of squares. On my Intel i7-6700HQ, it takes 0.43s for native code and 0.52 for SSE. So, where is a bottleneck? inline float squared_sum(const float x, const float y) { return x * x + y * y; } #define USE_SIMD void calculations() { high_resolution_clock::time_point t1, t2; int result_v = 0; t1 = high_resolution_clock::now(); alignas(16) float data_x[4]; alignas(16) float data_y[4]; alignas(16) float result[4]; __m128 v_x, v_y, v_res; for (int y = 0; y < 5120; y++) { data_y[0] = y; data_y[1] = y + 1; data_y[2] = y + 2; data_y[3] = y + 3; for (int x = 0; x < 5120; x++) { data_x[0] = x; data_x[1] = x + 1; data_x[2] = x + 2; data_x[3] = x + 3; #ifdef USE_SIMD v_x = _mm_load_ps(data_x); v_y = _mm_load_ps(data_y); v_x = _mm_mul_ps(v_x, v_x); v_y = _mm_mul_ps(v_y, v_y); v_res = _mm_add_ps(v_x, v_y); _mm_store_ps(result, v_res); #else result[0] = squared_sum(data_x[0], data_y[0]); result[1] = squared_sum(data_x[1], data_y[1]); result[2] = squared_sum(data_x[2], data_y[2]); result[3] = squared_sum(data_x[3], data_y[3]); #endif result_v += (int)(result[0] + result[1] + result[2] + result[3]); } } t2 = high_resolution_clock::now(); duration<double> time_span1 = duration_cast<duration<double>>(t2 - t1); std::cout << "Exec time:\t" << time_span1.count() << " s\n"; } UPDATE: fixed code according to comments. I am using Visual Studio 2017. Compiled for x64. Optimization: Maximum Optimization (Favor Speed) (/O2); Inline Function Expansion: Any Suitable (/Ob2); Favor Size or Speed: Favor fast code (/Ot); Omit Frame Pointers: Yes (/Oy) Conclusion Compilers generate already optimized code, so nowadays it is hard to accelerate it even more. The one thing you can do, to accelerate code more, is parallelization. Thanks for the answers. They mainly the same, so I accept Søren V. Poulsen answer because it was the first.
Modern compiles are incredible machines and will already use SIMD instructions if possible (and with the correct compilation flags). One general strategy to determine what the compiler is doing is looking at the disassembly of your code. If you don't want to do it on your own machine you can use an online service like Godbolt: https://gcc.godbolt.org/z/T6GooQ. One tip is to avoid atomic for storing intermediate results like you are doing here. Atomic values are used to ensure synchronization between threads, and this may come at a very high computational cost, relatively speaking.
Looking through the assembly for the compiler's code based (without your SIMD stuff), calculations(): pxor xmm2, xmm2 xor edx, edx movdqa xmm0, XMMWORD PTR .LC0[rip] movdqa xmm11, XMMWORD PTR .LC1[rip] movdqa xmm9, XMMWORD PTR .LC2[rip] movdqa xmm8, XMMWORD PTR .LC3[rip] movdqa xmm7, XMMWORD PTR .LC4[rip] .L4: movdqa xmm5, xmm0 movdqa xmm4, xmm0 cvtdq2ps xmm6, xmm0 movdqa xmm10, xmm0 paddd xmm0, xmm7 cvtdq2ps xmm3, xmm0 paddd xmm5, xmm9 paddd xmm4, xmm8 cvtdq2ps xmm5, xmm5 cvtdq2ps xmm4, xmm4 mulps xmm6, xmm6 mov eax, 5120 paddd xmm10, xmm11 mulps xmm5, xmm5 mulps xmm4, xmm4 mulps xmm3, xmm3 pxor xmm12, xmm12 .L2: movdqa xmm1, xmm12 cvtdq2ps xmm14, xmm12 mulps xmm14, xmm14 movdqa xmm13, xmm12 paddd xmm12, xmm7 cvtdq2ps xmm12, xmm12 paddd xmm1, xmm9 cvtdq2ps xmm0, xmm1 mulps xmm0, xmm0 paddd xmm13, xmm8 cvtdq2ps xmm13, xmm13 sub eax, 1 mulps xmm13, xmm13 addps xmm14, xmm6 mulps xmm12, xmm12 addps xmm0, xmm5 addps xmm13, xmm4 addps xmm12, xmm3 addps xmm0, xmm14 addps xmm0, xmm13 addps xmm0, xmm12 movdqa xmm12, xmm1 cvttps2dq xmm0, xmm0 paddd xmm2, xmm0 jne .L2 add edx, 1 movdqa xmm0, xmm10 cmp edx, 1280 jne .L4 movdqa xmm0, xmm2 psrldq xmm0, 8 paddd xmm2, xmm0 movdqa xmm0, xmm2 psrldq xmm0, 4 paddd xmm2, xmm0 movd eax, xmm2 ret main: xor eax, eax ret _GLOBAL__sub_I_calculations(): sub rsp, 8 mov edi, OFFSET FLAT:_ZStL8__ioinit call std::ios_base::Init::Init() [complete object constructor] mov edx, OFFSET FLAT:__dso_handle mov esi, OFFSET FLAT:_ZStL8__ioinit mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev add rsp, 8 jmp __cxa_atexit .LC0: .long 0 .long 1 .long 2 .long 3 .LC1: .long 4 .long 4 .long 4 .long 4 .LC2: .long 1 .long 1 .long 1 .long 1 .LC3: .long 2 .long 2 .long 2 .long 2 .LC4: .long 3 .long 3 .long 3 .long 3 Your SIMD code generates: calculations(): pxor xmm5, xmm5 xor eax, eax mov r8d, 1 movabs rdi, -4294967296 cvtsi2ss xmm5, eax .L4: mov r9d, r8d mov esi, 1 movd edx, xmm5 pxor xmm5, xmm5 pxor xmm4, xmm4 mov ecx, edx mov rdx, QWORD PTR [rsp-24] cvtsi2ss xmm5, r8d add r8d, 1 cvtsi2ss xmm4, r8d and rdx, rdi or rdx, rcx pxor xmm2, xmm2 mov edx, edx movd ecx, xmm5 sal rcx, 32 or rdx, rcx mov QWORD PTR [rsp-24], rdx movd edx, xmm4 pxor xmm4, xmm4 mov ecx, edx mov rdx, QWORD PTR [rsp-16] and rdx, rdi or rdx, rcx lea ecx, [r9+2] mov edx, edx cvtsi2ss xmm4, ecx movd ecx, xmm4 sal rcx, 32 or rdx, rcx mov QWORD PTR [rsp-16], rdx movaps xmm4, XMMWORD PTR [rsp-24] mulps xmm4, xmm4 .L2: movd edx, xmm2 mov r10d, esi pxor xmm2, xmm2 pxor xmm7, xmm7 mov ecx, edx mov rdx, QWORD PTR [rsp-40] cvtsi2ss xmm2, esi add esi, 1 and rdx, rdi cvtsi2ss xmm7, esi or rdx, rcx mov ecx, edx movd r11d, xmm2 movd edx, xmm7 sal r11, 32 or rcx, r11 pxor xmm7, xmm7 mov QWORD PTR [rsp-40], rcx mov ecx, edx mov rdx, QWORD PTR [rsp-32] and rdx, rdi or rdx, rcx lea ecx, [r10+2] mov edx, edx cvtsi2ss xmm7, ecx movd ecx, xmm7 sal rcx, 32 or rdx, rcx mov QWORD PTR [rsp-32], rdx movaps xmm0, XMMWORD PTR [rsp-40] mulps xmm0, xmm0 addps xmm0, xmm4 movaps xmm3, xmm0 movaps xmm1, xmm0 shufps xmm3, xmm0, 85 addss xmm1, xmm3 movaps xmm3, xmm0 unpckhps xmm3, xmm0 shufps xmm0, xmm0, 255 addss xmm1, xmm3 addss xmm0, xmm1 cvttss2si edx, xmm0 add eax, edx cmp r10d, 5120 jne .L2 cmp r9d, 5120 jne .L4 rep ret main: xor eax, eax ret _GLOBAL__sub_I_calculations(): sub rsp, 8 mov edi, OFFSET FLAT:_ZStL8__ioinit call std::ios_base::Init::Init() [complete object constructor] mov edx, OFFSET FLAT:__dso_handle mov esi, OFFSET FLAT:_ZStL8__ioinit mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev add rsp, 8 jmp __cxa_atexit Note that the compiler's version is using cvtdq2ps, paddd, cvtdq2ps, mulps, addps, and cvttps2dq. All of these are SIMD instructions. By combining them effectively, the compiler generates fast code. In constrast, your code generates a lot of add, and, cvtsi2ss, lea, mov, movd, or, pxor, sal, which are not SIMD instructions. I suspect the compiler does a better job of dealing with data type conversion and data rearrangement than you do, and that this allows it to arrange its math more effectively.
How to disable vectorization in clang++?
Consider the following small search function: template <uint32_t N> int32_t countsearch(const uint32_t *base, uint32_t needle) { uint32_t count = 0; #pragma clang loop vectorize(disable) for (const uint32_t *probe = base; probe < base + N; probe++) { if (*probe < needle) count++; } return count; } At -O2 or higher, clang vectorizes this search, e.g,. resulting in code like this (for 10 elements): int countsearch<10u>(unsigned int const*, unsigned int): # #int countsearch<10u>(unsigned int const*, unsigned int) vmovd xmm0, esi vpbroadcastd ymm0, xmm0 vpbroadcastd ymm1, dword ptr [rip + .LCPI0_0] # ymm1 = [2147483648,2147483648,2147483648,2147483648,2147483648,2147483648,2147483648,2147483648] vpxor ymm2, ymm1, ymmword ptr [rdi] vpxor ymm0, ymm0, ymm1 vpcmpgtd ymm0, ymm0, ymm2 cmp dword ptr [rdi + 32], esi vpsrld ymm1, ymm0, 31 vextracti128 xmm1, ymm1, 1 vpsubd ymm0, ymm1, ymm0 vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpaddd ymm0, ymm0, ymm1 vphaddd ymm0, ymm0, ymm0 vmovd eax, xmm0 adc eax, 0 cmp dword ptr [rdi + 36], esi adc eax, 0 vzeroupper ret How can I disable this vectorization on the command line or using a #pragma in the code? I tried the following command line arguments, none of which prevented the vectorization: -disable-loop-vectorization -disable-vectorization -fno-vectorize -fno-tree-vectorize I also tried #pragma clang loop vectorize(disable) above the loop as you seen in the code above, without luck.
Turn off SLP Vectorization: clang++ -O2 -fno-slp-vectorize Godbolt Link
Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?
I'm trying to implement some inline assembler (in Visual Studio 2012 C++ code) to take advantage of SSE. I want to add 7 numbers for 1e9 times so i placed them from RAM to xmm0 to xmm6 registers of CPU. when i do it with inline assembly in visual studio 2012 with this code: the C++ code: for(int i=0;i<count;i++) resVal+=val1+val2+val3+val4+val5+val6+val7; my ASM code: int count=1000000000; double resVal=0.0; //placing values to register __asm{ movsd xmm0,val1;placing var1 in xmm0 register movsd xmm1,val2 movsd xmm2,val3 movsd xmm3,val4 movsd xmm4,val5 movsd xmm5,val6 movsd xmm6,val7 pxor xmm7,xmm7;//turns xmm7 to zero } for(int i=0;i<count;i++) { __asm { addsd xmm7,xmm0;//+=var1 addsd xmm7,xmm1;//+=var2 addsd xmm7,xmm2; addsd xmm7,xmm3; addsd xmm7,xmm4; addsd xmm7,xmm5; addsd xmm7,xmm6;//+=var7 } } __asm { movsd resVal,xmm7;//placing xmm7 into resVal } and this is the dis assembled code from C++ compiler for the code 'resVal+=val1+val2+val3+val4+val5+val6+val7': movsd xmm0,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 As is visible the compiler uses just one xmm0 register and for other times it is fetching values from RAM. Answer of both codes (my ASM code and c++ code) is same but the c++ code takes about half the time of my asm code to execute! I was readed about CPU registers that working with them is much faster than memory. I dont think this ratio be true. Why the asm version have lower performance of C++ code?
Once the data is in the cache (which it will be the case after the first loop, if it's not there already), it makes little difference if you use memory or register. A floating point add will take a little longer than single cycle in the first place. The final store to resVal "unties" the xmm0 register to allow the register to be freely "renamed", which allows more of the loops to be run in parallel. This is a typical case of "unless you are absolutely sure, leave writing code to the compiler". The last bullet above explains why the code is faster than code where every step of the loop depends on a previously calculated result. In the compiler generated code, the loop can do the equivalent of: movsd xmm0,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 movsd xmm1,mmword ptr [val1] addsd xmm1,mmword ptr [val2] addsd xmm1,mmword ptr [val3] addsd xmm1,mmword ptr [val4] addsd xmm1,mmword ptr [val5] addsd xmm1,mmword ptr [val6] addsd xmm1,mmword ptr [val7] addsd xmm1,mmword ptr [resVal] movsd mmword ptr [resVal],xmm1 Now, as you can see, we could "mingle" these two "threads": movsd xmm0,mmword ptr [val1] movsd xmm1,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm1,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm1,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm1,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm1,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm1,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm1,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 // Here we have to wait for resval to be uppdated! addsd xmm1,mmword ptr [resVal] movsd mmword ptr [resVal],xmm1 I'm not suggesting it is quite that much out of order execution, but I can certainly see how the loop can be executed faster that your loop. You can probably achieve the same thing in your assembler code if you had a spare register [in x86_64 you do have another 8 registers, although you can't use inline assembler in x86_64...] (Note that register renaming is different from my "threaded" loop, which is using two different registers - but the effect is roughly the same, the loop can continue after it hits the "resVal" update without having to wait for the result to be updated)
May be it's useful for you not use _asm, but intrinsics functions and instrinsic types like __m128i of __m128d witch are represent sse registers. See immintrin.h it's define types and many sse functions. you can find good description and spec for them here :http://software.intel.com/sites/landingpage/IntrinsicsGuide/
SSE2 - 16-byte aligned dynamic allocation of memory
EDIT: This is a followup to SSE2 Compiler Error This is the real bug I experienced before and have reproduced below by changing the _mm_malloc statement as Michael Burr suggested: Unhandled exception at 0x00415116 in SO.exe: 0xC0000005: Access violation reading location 0xffffffff. At line label: movdqa xmm0, xmmword ptr [t1+eax] I'm trying to dynamically allocate t1 and t2 and according to this tutorial, I've used _mm_malloc: #include <emmintrin.h> int main(int argc, char* argv[]) { int *t1, *t2; const int n = 100000; t1 = (int*)_mm_malloc(n*sizeof(int),16); t2 = (int*)_mm_malloc(n*sizeof(int),16); __m128i mul1, mul2; for (int j = 0; j < n; j++) { t1[j] = j; t2[j] = (j+1); } // set temporary variables to random values _asm { mov eax, 0 label: movdqa xmm0, xmmword ptr [t1+eax] movdqa xmm1, xmmword ptr [t2+eax] pmuludq xmm0, xmm1 movdqa mul1, xmm0 movdqa xmm0, xmmword ptr [t1+eax] pshufd xmm0, xmm0, 05fh pshufd xmm1, xmm1, 05fh pmuludq xmm0, xmm1 movdqa mul2, xmm0 add eax, 16 cmp eax, 100000 jnge label } _mm_free(t1); _mm_free(t2); return 0; }
I think the 2nd problem is that you're reading at an offset from the pointer variable (not an offset from what the pointer points to). Change: label: movdqa xmm0, xmmword ptr [t1+eax] To something like: mov ebx, [t1] label: movdqa xmm0, xmmword ptr [ebx+eax] And similarly for your accesses through the t2 pointer. This might be even better (though I haven't had an opportunity to test it, so it might not even work): _asm { mov eax, [t1] mov ebx, [t1] lea ecx, [eax + (100000*4)] label: movdqa xmm0, xmmword ptr [eax] movdqa xmm1, xmmword ptr [ebx] pmuludq xmm0, xmm1 movdqa mul1, xmm0 movdqa xmm0, xmmword ptr [eax] pshufd xmm0, xmm0, 05fh pshufd xmm1, xmm1, 05fh pmuludq xmm0, xmm1 movdqa mul2, xmm0 add eax, 16 add ebx, 16 cmp eax, ecx jnge label }
You're not allocating enough memory: t1 = (int*)_mm_malloc(n * sizeof( int),16); t2 = (int*)_mm_malloc(n * sizeof( int),16);
Perhaps: t1 = (int*)_mm_malloc(n*sizeof(int),16);