Bug in VC++ 14.0 (2015) compiler? - c++
I've been running into some issues that only occurred during Release x86 mode and not during Release x64 or any Debug mode. I managed to reproduce the bug using the following code:
#include <stdio.h>
#include <iostream>
using namespace std;
struct WMatrix {
float _11, _12, _13, _14;
float _21, _22, _23, _24;
float _31, _32, _33, _34;
float _41, _42, _43, _44;
WMatrix(float f11, float f12, float f13, float f14,
float f21, float f22, float f23, float f24,
float f31, float f32, float f33, float f34,
float f41, float f42, float f43, float f44) :
_11(f11), _12(f12), _13(f13), _14(f14),
_21(f21), _22(f22), _23(f23), _24(f24),
_31(f31), _32(f32), _33(f33), _34(f34),
_41(f41), _42(f42), _43(f43), _44(f44) {
}
};
void printmtx(WMatrix m1) {
char str[256];
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._11, m1._12, m1._13, m1._14);
cout << str << "\n";
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._21, m1._22, m1._23, m1._24);
cout << str << "\n";
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._31, m1._32, m1._33, m1._34);
cout << str << "\n";
sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._41, m1._42, m1._43, m1._44);
cout << str << "\n";
}
WMatrix mul1(WMatrix m, float f) {
WMatrix out = m;
for (unsigned int i = 0; i < 4; i++) {
for (unsigned int j = 0; j < 4; j++) {
unsigned int idx = i * 4 + j; // critical code
*(&out._11 + idx) *= f; // critical code
}
}
return out;
}
WMatrix mul2(WMatrix m, float f) {
WMatrix out = m;
unsigned int idx2 = 0;
for (unsigned int i = 0; i < 4; i++) {
for (unsigned int j = 0; j < 4; j++) {
unsigned int idx = i * 4 + j; // critical code
bool b = idx == idx2; // critical code
*(&out._11 + idx) *= f; // critical code
idx2++;
}
}
return out;
}
int main() {
WMatrix m1(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
WMatrix m2 = mul1(m1, 0.5f);
WMatrix m3 = mul2(m1, 0.5f);
printmtx(m1);
cout << "\n";
printmtx(m2);
cout << "\n";
printmtx(m3);
int x;
cin >> x;
}
In the above code, mul2 works, but mul1 does not. mul1 and mul2 are simply trying to iterate over the floats in the WMatrix and multiply them by f, but the way mul1 indexes (i*4+j) somehow evaluates to incorrect results. All mul2 does different is it checks the index before using it and then it works (there are many other ways of tinkering with the index to make it work). Notice if you remove the line "bool b = idx == idx2" then mul2 also breaks...
Here is the output:
1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000
0.500, 0.500, 0.375, 0.250
0.625, 1.500, 3.500, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000
0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000
Correct output should be...
1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000
0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000
0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000
Am I missing something? Or is it actually a bug in the compiler?
This afflicts only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem manifest in 32-bit builds whether optimizing for speed (/O2) or size (/O1). As you mentioned, it works as expected in debugging builds with optimization disabled.
Wimmel's suggestion of changing the packing, accurate though it is, does not change the behavior. (The code below assumes the packing is correctly set to 1 for WMatrix.)
I can't reproduce it in VS 2010, but I can in VS 2013 and 2015. I don't have 2012 installed. That's good enough, though, to allow us to analyze the difference between the object code produced by the two compilers.
Here is the code for mul1 from VS 2010 (the "working" code):
(Actually, in many cases, the compiler inlined the code from this function at the call site. But the compiler will still output disassembly files containing the code it generated for the individual functions prior to inlining. That's what we're looking at here, because it is more cluttered. The behavior of the code is entirely equivalent whether it's been inlined or not.)
PUBLIC mul1
_TEXT SEGMENT
_m$ = 8 ; size = 64
_f$ = 72 ; size = 4
mul1 PROC
___$ReturnUdt$ = eax
push esi
push edi
; WMatrix out = m;
mov ecx, 16 ; 00000010H
lea esi, DWORD PTR _m$[esp+4]
mov edi, eax
rep movsd
; for (unsigned int i = 0; i < 4; i++)
; {
; for (unsigned int j = 0; j < 4; j++)
; {
; unsigned int idx = i * 4 + j; // critical code
; *(&out._11 + idx) *= f; // critical code
movss xmm0, DWORD PTR [eax]
cvtps2pd xmm1, xmm0
movss xmm0, DWORD PTR _f$[esp+4]
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax], xmm1
movss xmm1, DWORD PTR [eax+4]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+4], xmm1
movss xmm1, DWORD PTR [eax+8]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+8], xmm1
movss xmm1, DWORD PTR [eax+12]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+12], xmm1
movss xmm2, DWORD PTR [eax+16]
cvtps2pd xmm2, xmm2
cvtps2pd xmm1, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+16], xmm1
movss xmm1, DWORD PTR [eax+20]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+20], xmm1
movss xmm1, DWORD PTR [eax+24]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+24], xmm1
movss xmm1, DWORD PTR [eax+28]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+28], xmm1
movss xmm1, DWORD PTR [eax+32]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+32], xmm1
movss xmm1, DWORD PTR [eax+36]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+36], xmm1
movss xmm2, DWORD PTR [eax+40]
cvtps2pd xmm2, xmm2
cvtps2pd xmm1, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+40], xmm1
movss xmm1, DWORD PTR [eax+44]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+44], xmm1
movss xmm2, DWORD PTR [eax+48]
cvtps2pd xmm1, xmm0
cvtps2pd xmm2, xmm2
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+48], xmm1
movss xmm1, DWORD PTR [eax+52]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
movss DWORD PTR [eax+52], xmm1
movss xmm1, DWORD PTR [eax+56]
cvtps2pd xmm1, xmm1
cvtps2pd xmm2, xmm0
mulsd xmm1, xmm2
cvtpd2ps xmm1, xmm1
cvtps2pd xmm0, xmm0
movss DWORD PTR [eax+56], xmm1
movss xmm1, DWORD PTR [eax+60]
cvtps2pd xmm1, xmm1
mulsd xmm1, xmm0
pop edi
cvtpd2ps xmm0, xmm1
movss DWORD PTR [eax+60], xmm0
pop esi
; return out;
ret 0
mul1 ENDP
Compare that to the code for mul1 generated by VS 2015:
mul1 PROC
_m$ = 8 ; size = 64
; ___$ReturnUdt$ = ecx
; _f$ = xmm2s
; WMatrix out = m;
movups xmm0, XMMWORD PTR _m$[esp-4]
; for (unsigned int i = 0; i < 4; i++)
xor eax, eax
movaps xmm1, xmm2
movups XMMWORD PTR [ecx], xmm0
movups xmm0, XMMWORD PTR _m$[esp+12]
shufps xmm1, xmm1, 0
movups XMMWORD PTR [ecx+16], xmm0
movups xmm0, XMMWORD PTR _m$[esp+28]
movups XMMWORD PTR [ecx+32], xmm0
movups xmm0, XMMWORD PTR _m$[esp+44]
movups XMMWORD PTR [ecx+48], xmm0
npad 4
$LL4#mul1:
; for (unsigned int j = 0; j < 4; j++)
; {
; unsigned int idx = i * 4 + j; // critical code
; *(&out._11 + idx) *= f; // critical code
movups xmm0, XMMWORD PTR [ecx+eax*4]
mulps xmm0, xmm1
movups XMMWORD PTR [ecx+eax*4], xmm0
inc eax
cmp eax, 4
jb SHORT $LL4#mul1
; return out;
mov eax, ecx
ret 0
?mul1##YA?AUWMatrix##U1#M#Z ENDP ; mul1
_TEXT ENDS
It is immediately obvious how much shorter the code is. Apparently the optimizer got a lot smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer's "smarts" is the exploitation of bugs in your code.
Looking at the code that matches up with the loops, you can see that VS 2010 is unrolling the loops. All of the computations are done inline so that there are no branches. This is kind of what you'd expect for loops with upper and lower bounds that are known at compile time and, as in this case, reasonably small.
What happened in VS 2015? Well, it didn't unroll anything. There are 5 lines of code, and then a conditional jump JB back to the top of the loop sequence. That alone doesn't tell you much. What does look highly suspicious is that it only loops 4 times (see the cmp eax, 4 statement that sets flags right before doing the jb, effectively continuing the loop as long as the counter is less than 4). Well, that might be okay if it had merged the two loops into one. Let's see what it's doing inside of the loop:
$LL4#mul1:
movups xmm0, XMMWORD PTR [ecx+eax*4] ; load a packed unaligned value into XMM0
mulps xmm0, xmm1 ; do a packed multiplication of XMM0 by XMM1,
; storing the result in XMM0
movups XMMWORD PTR [ecx+eax*4], xmm0 ; store the result of the previous multiplication
; back into the memory location that we
; initially loaded from
inc eax ; one iteration done, increment loop counter
cmp eax, 4 ; see how many loops we've done
jb $LL4#mul1 ; keep looping if < 4 iterations
The code reads a value from memory (an XMM-sized value from the location determined by ecx + eax * 4) into XMM0, multiplies it by a value in XMM1 (which was set outside the loop, based on the f parameter), and then stores the result back into the original memory location.
Compare that to the code for the corresponding loop in mul2:
$LL4#mul2:
lea eax, DWORD PTR [eax+16]
movups xmm0, XMMWORD PTR [eax-24]
mulps xmm0, xmm2
movups XMMWORD PTR [eax-24], xmm0
sub ecx, 1
jne $LL4#mul2
Aside from a different loop control sequence (this sets ECX to 4 outside of the loop, subtracts 1 each time through, and keeps looping as long as ECX != 0), the big difference here is the actual XMM values that it manipulates in memory. Instead of loading from [ecx+eax*4], it loads from [eax-24] (after having previously added 16 to EAX).
What's different about mul2? You had added code to track a separate index in idx2, incrementing it each time through the loop. Now, this alone would not be enough. If you comment out the assignment to the bool variable b, mul1 and mul2 result in identical object code. Clearly without the comparison of idx to idx2, the compiler is able to deduce that idx2 is completely unused, and therefore eliminate it, turning mul2 into mul1. But with that comparison, the compiler apparently becomes unable to eliminate idx2, and its presence ever so slightly changes what optimizations are deemed possible for the function, resulting in the output discrepancy.
Now the question turns to why is this happening. Is it an optimizer bug, as you first suspected? Well, no—and as some of the commenters have mentioned, it should never be your first instinct to blame the compiler/optimizer. Always assume that there are bugs in your code unless you can prove otherwise. That proof would always involve looking at the disassembly, and preferably referencing the relevant portions of the language standard if you really want to be taken seriously.
In this case, Mystical has already nailed the problem. Your code exhibits undefined behavior when it does *(&out._11 + idx). This makes certain assumptions about the layout of the WMatrix struct in memory, which you cannot legally make, even after explicitly setting the packing.
This is why undefined behavior is evil—it results in code that seems to work sometimes, but other times it doesn't. It is very sensitive to compiler flags, especially optimizations, but also target platforms (as we saw at the top of this answer). mul2 only works by accident. Both mul1 and mul2 are wrong. Unfortunately, the bug is in your code. Worse, the compiler didn't issue a warning that might have alerted you to your use of undefined behavior.
If we look at the generated code, the problem is fairly clear. Ignoring a few bits and pieces that aren't related to the problem at hand, mul1 produces code like this:
movss xmm1, DWORD PTR _f$[esp-4] ; load xmm1 from _11 of source
; ...
shufps xmm1, xmm1, 0 ; duplicate _11 across floats of xmm1
; ...
for ecx = 0 to 3 {
movups xmm0, XMMWORD PTR [dest+ecx*4] ; load 4 floats from dest
mulps xmm0, xmm1 ; multiply each by _11
movups XMMWORD PTR [dest+ecx*4], xmm0 ; store result back to dest
}
So, instead of multiplying each element of one matrix by the corresponding element of the other matrix, it's multiplying each element of one matrix by _11 of the other matrix.
Although it's impossible to confirm exactly how it happened (without looking through the compiler's source code), this certainly fits with #Mysticial's guess about how the problem arose.
Related
Loop unroll issue with Visual Studio compiler
I have some simple setup, where I noticed that VS compiler seems not smart enough to unroll loop, but other compilers like clang or gcc do so. Do I miss some optimization flag for VS? #include <cstddef> struct A { double data[4]; double *begin() { return data; } double *end() { return data + 4; } double const *begin() const { return data; } double const *end() const { return data + 4; } }; double sum_index(A const &a) { double ret = 0; for(std::size_t i = 0; i < 4; ++i) { ret += a.data[i]; } return ret; } double sum_iter(A const &a) { double ret = 0; for(auto const &v : a) { ret += v; } return ret; } I used https://godbolt.org/ compiler explorer to generate assembler code. gcc 11.2 with -O3: sum_index(A const&): pxor xmm0, xmm0 addsd xmm0, QWORD PTR [rdi] addsd xmm0, QWORD PTR [rdi+8] addsd xmm0, QWORD PTR [rdi+16] addsd xmm0, QWORD PTR [rdi+24] ret sum_iter(A const&): movsd xmm1, QWORD PTR [rdi] addsd xmm1, QWORD PTR .LC0[rip] movsd xmm0, QWORD PTR [rdi+8] addsd xmm1, xmm0 movupd xmm0, XMMWORD PTR [rdi+16] addsd xmm1, xmm0 unpckhpd xmm0, xmm0 addsd xmm0, xmm1 ret .LC0: .long 0 .long 0 clang 13.0.1 with -O3: sum_index(A const&): # #sum_index(A const&) xorpd xmm0, xmm0 addsd xmm0, qword ptr [rdi] addsd xmm0, qword ptr [rdi + 8] addsd xmm0, qword ptr [rdi + 16] addsd xmm0, qword ptr [rdi + 24] ret sum_iter(A const&): # #sum_iter(A const&) xorpd xmm0, xmm0 addsd xmm0, qword ptr [rdi] addsd xmm0, qword ptr [rdi + 8] addsd xmm0, qword ptr [rdi + 16] addsd xmm0, qword ptr [rdi + 24] ret MSVC 19.30 with /O2 (there is no /O3?): this$ = 8 double const * A::begin(void)const PROC ; A::begin, COMDAT mov rax, rcx ret 0 double const * A::begin(void)const ENDP ; A::begin this$ = 8 double const * A::end(void)const PROC ; A::end, COMDAT lea rax, QWORD PTR [rcx+32] ret 0 double const * A::end(void)const ENDP ; A::end a$ = 8 double sum_index(A const &) PROC ; sum_index, COMDAT movsd xmm0, QWORD PTR [rcx] xorps xmm1, xmm1 addsd xmm0, xmm1 addsd xmm0, QWORD PTR [rcx+8] addsd xmm0, QWORD PTR [rcx+16] addsd xmm0, QWORD PTR [rcx+24] ret 0 double sum_index(A const &) ENDP ; sum_index a$ = 8 double sum_iter(A const &) PROC ; sum_iter, COMDAT lea rax, QWORD PTR [rcx+32] xorps xmm0, xmm0 cmp rcx, rax je SHORT $LN12#sum_iter npad 4 $LL8#sum_iter: addsd xmm0, QWORD PTR [rcx] add rcx, 8 cmp rcx, rax jne SHORT $LL8#sum_iter $LN12#sum_iter: ret 0 double sum_iter(A const &) ENDP ; sum_iter Obviously there is problem with unrolling the loop for MSVC. Is there some additional optimization flag I have to set? Thanks for help!
Is clang really adding vectors optimally in this C/C++ example?
I have the following C/C++ code: #define SIZE 2 typedef struct vec { float data[SIZE]; } vec; vec add(vec a, vec b) { vec result; for (size_t i = 0; i < SIZE; ++i) { result.data[i] = a.data[i] + b.data[i]; } return result; } I was wondering how clang would optimize this vector addition and the compiler output surprised me, as it looks quite unoptimal. This is at -O3 and with -march=skylake. (Godbolt with clang 10.1) add(vec, vec): vaddss xmm2, xmm0, xmm1 # res[0] = a[0] + b[0] vmovss dword ptr [rsp - 8], xmm2 # mem[1] = res[0] vmovshdup xmm0, xmm0 # a[0] = a[1] vmovshdup xmm1, xmm1 # b[0] = b[1] vaddss xmm0, xmm0, xmm1 # a[0] = a[0] + b[0] vmovss dword ptr [rsp - 4], xmm0 # mem[0] = a[0] vmovsd xmm0, qword ptr [rsp - 8] # xmm0 = mem[0],mem[1],zero,zero ret From the looks of it, a and b are stored in xmm0 and xmm1 respectively. However, only the lowest single-precision float in these registers is being used for addition. This leads to two separate additions. Why isn't vaddps used instead, which would allow for adding both values simultaneously? The only thing I could come up with is that clang tries to preserve the higher two floats in the xmm registers. This is why I also tried increasing SIZE to 4, but now I get: add(vec, vec): vaddps xmm0, xmm0, xmm2 vaddps xmm1, xmm1, xmm3 vmovlhps xmm0, xmm0, xmm1 vmovaps xmmword ptr [rsp - 24], xmm0 vmovsd xmm0, qword ptr [rsp - 24] vmovsd xmm1, qword ptr [rsp - 16] ret So for whatever reason, clang now doesn't even use the highest two floats and spreads the vectors between xmm0 to xmm3. An xmm register is 128 bits large, so it should be able to fit all four floats. Then this code would be much simpler and only a single addition would be necessary. (See Compiler Explorer)
Why my SSE code is slower than native C++ code?
First of all, I am new to SSE. I decided to accelerate my code, but it seems, that it works slower, then my native code. This is an example, that calculates the sum of squares. On my Intel i7-6700HQ, it takes 0.43s for native code and 0.52 for SSE. So, where is a bottleneck? inline float squared_sum(const float x, const float y) { return x * x + y * y; } #define USE_SIMD void calculations() { high_resolution_clock::time_point t1, t2; int result_v = 0; t1 = high_resolution_clock::now(); alignas(16) float data_x[4]; alignas(16) float data_y[4]; alignas(16) float result[4]; __m128 v_x, v_y, v_res; for (int y = 0; y < 5120; y++) { data_y[0] = y; data_y[1] = y + 1; data_y[2] = y + 2; data_y[3] = y + 3; for (int x = 0; x < 5120; x++) { data_x[0] = x; data_x[1] = x + 1; data_x[2] = x + 2; data_x[3] = x + 3; #ifdef USE_SIMD v_x = _mm_load_ps(data_x); v_y = _mm_load_ps(data_y); v_x = _mm_mul_ps(v_x, v_x); v_y = _mm_mul_ps(v_y, v_y); v_res = _mm_add_ps(v_x, v_y); _mm_store_ps(result, v_res); #else result[0] = squared_sum(data_x[0], data_y[0]); result[1] = squared_sum(data_x[1], data_y[1]); result[2] = squared_sum(data_x[2], data_y[2]); result[3] = squared_sum(data_x[3], data_y[3]); #endif result_v += (int)(result[0] + result[1] + result[2] + result[3]); } } t2 = high_resolution_clock::now(); duration<double> time_span1 = duration_cast<duration<double>>(t2 - t1); std::cout << "Exec time:\t" << time_span1.count() << " s\n"; } UPDATE: fixed code according to comments. I am using Visual Studio 2017. Compiled for x64. Optimization: Maximum Optimization (Favor Speed) (/O2); Inline Function Expansion: Any Suitable (/Ob2); Favor Size or Speed: Favor fast code (/Ot); Omit Frame Pointers: Yes (/Oy) Conclusion Compilers generate already optimized code, so nowadays it is hard to accelerate it even more. The one thing you can do, to accelerate code more, is parallelization. Thanks for the answers. They mainly the same, so I accept Søren V. Poulsen answer because it was the first.
Modern compiles are incredible machines and will already use SIMD instructions if possible (and with the correct compilation flags). One general strategy to determine what the compiler is doing is looking at the disassembly of your code. If you don't want to do it on your own machine you can use an online service like Godbolt: https://gcc.godbolt.org/z/T6GooQ. One tip is to avoid atomic for storing intermediate results like you are doing here. Atomic values are used to ensure synchronization between threads, and this may come at a very high computational cost, relatively speaking.
Looking through the assembly for the compiler's code based (without your SIMD stuff), calculations(): pxor xmm2, xmm2 xor edx, edx movdqa xmm0, XMMWORD PTR .LC0[rip] movdqa xmm11, XMMWORD PTR .LC1[rip] movdqa xmm9, XMMWORD PTR .LC2[rip] movdqa xmm8, XMMWORD PTR .LC3[rip] movdqa xmm7, XMMWORD PTR .LC4[rip] .L4: movdqa xmm5, xmm0 movdqa xmm4, xmm0 cvtdq2ps xmm6, xmm0 movdqa xmm10, xmm0 paddd xmm0, xmm7 cvtdq2ps xmm3, xmm0 paddd xmm5, xmm9 paddd xmm4, xmm8 cvtdq2ps xmm5, xmm5 cvtdq2ps xmm4, xmm4 mulps xmm6, xmm6 mov eax, 5120 paddd xmm10, xmm11 mulps xmm5, xmm5 mulps xmm4, xmm4 mulps xmm3, xmm3 pxor xmm12, xmm12 .L2: movdqa xmm1, xmm12 cvtdq2ps xmm14, xmm12 mulps xmm14, xmm14 movdqa xmm13, xmm12 paddd xmm12, xmm7 cvtdq2ps xmm12, xmm12 paddd xmm1, xmm9 cvtdq2ps xmm0, xmm1 mulps xmm0, xmm0 paddd xmm13, xmm8 cvtdq2ps xmm13, xmm13 sub eax, 1 mulps xmm13, xmm13 addps xmm14, xmm6 mulps xmm12, xmm12 addps xmm0, xmm5 addps xmm13, xmm4 addps xmm12, xmm3 addps xmm0, xmm14 addps xmm0, xmm13 addps xmm0, xmm12 movdqa xmm12, xmm1 cvttps2dq xmm0, xmm0 paddd xmm2, xmm0 jne .L2 add edx, 1 movdqa xmm0, xmm10 cmp edx, 1280 jne .L4 movdqa xmm0, xmm2 psrldq xmm0, 8 paddd xmm2, xmm0 movdqa xmm0, xmm2 psrldq xmm0, 4 paddd xmm2, xmm0 movd eax, xmm2 ret main: xor eax, eax ret _GLOBAL__sub_I_calculations(): sub rsp, 8 mov edi, OFFSET FLAT:_ZStL8__ioinit call std::ios_base::Init::Init() [complete object constructor] mov edx, OFFSET FLAT:__dso_handle mov esi, OFFSET FLAT:_ZStL8__ioinit mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev add rsp, 8 jmp __cxa_atexit .LC0: .long 0 .long 1 .long 2 .long 3 .LC1: .long 4 .long 4 .long 4 .long 4 .LC2: .long 1 .long 1 .long 1 .long 1 .LC3: .long 2 .long 2 .long 2 .long 2 .LC4: .long 3 .long 3 .long 3 .long 3 Your SIMD code generates: calculations(): pxor xmm5, xmm5 xor eax, eax mov r8d, 1 movabs rdi, -4294967296 cvtsi2ss xmm5, eax .L4: mov r9d, r8d mov esi, 1 movd edx, xmm5 pxor xmm5, xmm5 pxor xmm4, xmm4 mov ecx, edx mov rdx, QWORD PTR [rsp-24] cvtsi2ss xmm5, r8d add r8d, 1 cvtsi2ss xmm4, r8d and rdx, rdi or rdx, rcx pxor xmm2, xmm2 mov edx, edx movd ecx, xmm5 sal rcx, 32 or rdx, rcx mov QWORD PTR [rsp-24], rdx movd edx, xmm4 pxor xmm4, xmm4 mov ecx, edx mov rdx, QWORD PTR [rsp-16] and rdx, rdi or rdx, rcx lea ecx, [r9+2] mov edx, edx cvtsi2ss xmm4, ecx movd ecx, xmm4 sal rcx, 32 or rdx, rcx mov QWORD PTR [rsp-16], rdx movaps xmm4, XMMWORD PTR [rsp-24] mulps xmm4, xmm4 .L2: movd edx, xmm2 mov r10d, esi pxor xmm2, xmm2 pxor xmm7, xmm7 mov ecx, edx mov rdx, QWORD PTR [rsp-40] cvtsi2ss xmm2, esi add esi, 1 and rdx, rdi cvtsi2ss xmm7, esi or rdx, rcx mov ecx, edx movd r11d, xmm2 movd edx, xmm7 sal r11, 32 or rcx, r11 pxor xmm7, xmm7 mov QWORD PTR [rsp-40], rcx mov ecx, edx mov rdx, QWORD PTR [rsp-32] and rdx, rdi or rdx, rcx lea ecx, [r10+2] mov edx, edx cvtsi2ss xmm7, ecx movd ecx, xmm7 sal rcx, 32 or rdx, rcx mov QWORD PTR [rsp-32], rdx movaps xmm0, XMMWORD PTR [rsp-40] mulps xmm0, xmm0 addps xmm0, xmm4 movaps xmm3, xmm0 movaps xmm1, xmm0 shufps xmm3, xmm0, 85 addss xmm1, xmm3 movaps xmm3, xmm0 unpckhps xmm3, xmm0 shufps xmm0, xmm0, 255 addss xmm1, xmm3 addss xmm0, xmm1 cvttss2si edx, xmm0 add eax, edx cmp r10d, 5120 jne .L2 cmp r9d, 5120 jne .L4 rep ret main: xor eax, eax ret _GLOBAL__sub_I_calculations(): sub rsp, 8 mov edi, OFFSET FLAT:_ZStL8__ioinit call std::ios_base::Init::Init() [complete object constructor] mov edx, OFFSET FLAT:__dso_handle mov esi, OFFSET FLAT:_ZStL8__ioinit mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev add rsp, 8 jmp __cxa_atexit Note that the compiler's version is using cvtdq2ps, paddd, cvtdq2ps, mulps, addps, and cvttps2dq. All of these are SIMD instructions. By combining them effectively, the compiler generates fast code. In constrast, your code generates a lot of add, and, cvtsi2ss, lea, mov, movd, or, pxor, sal, which are not SIMD instructions. I suspect the compiler does a better job of dealing with data type conversion and data rearrangement than you do, and that this allows it to arrange its math more effectively.
GCC std::sin vectorization bug?
The next code (with -O3 -ffast-math): #include <cmath> float a[4]; void sin1() { for(unsigned i = 0; i < 4; i++) a[i] = sinf(a[i]); } Compiles vectorized version of sinf (_ZGVbN4v_sinf): sin1(): sub rsp, 8 movaps xmm0, XMMWORD PTR a[rip] call _ZGVbN4v_sinf movaps XMMWORD PTR a[rip], xmm0 add rsp, 8 ret But when i use c++ version of sinf (std::sin) no vectorization occurrs: void sin2() { for(unsigned i = 0; i < 4; i++) a[i] = std::sin(a[i]); } sin2(): sub rsp, 8 movss xmm0, DWORD PTR a[rip] call sinf movss DWORD PTR a[rip], xmm0 movss xmm0, DWORD PTR a[rip+4] call sinf movss DWORD PTR a[rip+4], xmm0 movss xmm0, DWORD PTR a[rip+8] call sinf movss DWORD PTR a[rip+8], xmm0 movss xmm0, DWORD PTR a[rip+12] call sinf movss DWORD PTR a[rip+12], xmm0 add rsp, 8 ret Compiler Explorer Code
SSE2 - 16-byte aligned dynamic allocation of memory
EDIT: This is a followup to SSE2 Compiler Error This is the real bug I experienced before and have reproduced below by changing the _mm_malloc statement as Michael Burr suggested: Unhandled exception at 0x00415116 in SO.exe: 0xC0000005: Access violation reading location 0xffffffff. At line label: movdqa xmm0, xmmword ptr [t1+eax] I'm trying to dynamically allocate t1 and t2 and according to this tutorial, I've used _mm_malloc: #include <emmintrin.h> int main(int argc, char* argv[]) { int *t1, *t2; const int n = 100000; t1 = (int*)_mm_malloc(n*sizeof(int),16); t2 = (int*)_mm_malloc(n*sizeof(int),16); __m128i mul1, mul2; for (int j = 0; j < n; j++) { t1[j] = j; t2[j] = (j+1); } // set temporary variables to random values _asm { mov eax, 0 label: movdqa xmm0, xmmword ptr [t1+eax] movdqa xmm1, xmmword ptr [t2+eax] pmuludq xmm0, xmm1 movdqa mul1, xmm0 movdqa xmm0, xmmword ptr [t1+eax] pshufd xmm0, xmm0, 05fh pshufd xmm1, xmm1, 05fh pmuludq xmm0, xmm1 movdqa mul2, xmm0 add eax, 16 cmp eax, 100000 jnge label } _mm_free(t1); _mm_free(t2); return 0; }
I think the 2nd problem is that you're reading at an offset from the pointer variable (not an offset from what the pointer points to). Change: label: movdqa xmm0, xmmword ptr [t1+eax] To something like: mov ebx, [t1] label: movdqa xmm0, xmmword ptr [ebx+eax] And similarly for your accesses through the t2 pointer. This might be even better (though I haven't had an opportunity to test it, so it might not even work): _asm { mov eax, [t1] mov ebx, [t1] lea ecx, [eax + (100000*4)] label: movdqa xmm0, xmmword ptr [eax] movdqa xmm1, xmmword ptr [ebx] pmuludq xmm0, xmm1 movdqa mul1, xmm0 movdqa xmm0, xmmword ptr [eax] pshufd xmm0, xmm0, 05fh pshufd xmm1, xmm1, 05fh pmuludq xmm0, xmm1 movdqa mul2, xmm0 add eax, 16 add ebx, 16 cmp eax, ecx jnge label }
You're not allocating enough memory: t1 = (int*)_mm_malloc(n * sizeof( int),16); t2 = (int*)_mm_malloc(n * sizeof( int),16);
Perhaps: t1 = (int*)_mm_malloc(n*sizeof(int),16);