I know there is a similar question about this: constexpr performing worse at runtime.
But my case is a lot simpler than that one, and the answers were not enough for me.
I'm just learning about constexpr in C++11 and a wrote a code to compare its efficiency, and for some reason, using constexpr makes my code run more than 4 times slower!
By the way, i'm using exactly the same example as in this site: https://www.embarcados.com.br/introducao-ao-cpp11/ (its in Portuguese but you can see the example code about constexpr). Already tried other expressions and the results are similar.
constexpr double divideC(double num){
return (2.0 * num + 10.0) / 0.8;
}
#define SIZE 1000
int main(int argc, char const *argv[])
{
// Get number of iterations from user
unsigned long long count;
cin >> count;
double values[SIZE];
// Testing normal expression
clock_t time1 = clock();
for (int i = 0; i < count; i++)
{
values[i%SIZE] = (2.0 * 3.0 + 10.0) / 0.8;
}
time1 = clock() - time1;
cout << "Time1: " << float(time1)/float(CLOCKS_PER_SEC) << " seconds" << endl;
// Testing constexpr
clock_t time2 = clock();
for (int i = 0; i < count; i++)
{
values[i%SIZE] = divideC( 3.0 );
}
time2 = clock() - time2;
cout << "Time2: " << float(time2)/float(CLOCKS_PER_SEC) << " seconds" << endl;
return 0;
}
Input given:
9999999999
Ouput:
> Time1: 5.768 seconds
> Time2: 27.259 seconds
Can someone tell me the reason of this? As constexpr calculations should run in compile time, it's supposed to run this code faster and not slower.
I'm using msbuild version 16.6.0.22303 to compile the Visual Studio project generated by the following CMake code:
cmake_minimum_required(VERSION 3.1.3)
project(C++11Tests)
add_executable(Cpp11Tests main.cpp)
set_property(TARGET Cpp11Tests PROPERTY CXX_STANDARD_REQUIRED ON)
set_property(TARGET Cpp11Tests PROPERTY CXX_STANDARD 11)
Without optimizations, the compiler will keep the divideC call so it is slower.
With optimizations on any decent compiler knows that - for the given code - everything related to values can be optimized away without any side-effects. So the shown code can never give any meaningful measurements between the difference of values[i%SIZE] = (2.0 * 3.0 + 10.0) / 0.8; or values[i%SIZE] = divideC( 3.0 );
With -O1 any decent compiler will create something this:
for (int i = 0; i < count; i++)
{
values[i%SIZE] = (2.0 * 3.0 + 10.0) / 0.8;
}
results in:
mov rdx, QWORD PTR [rsp+8]
test rdx, rdx
je .L2
mov eax, 0
.L3:
add eax, 1
cmp edx, eax
jne .L3
.L2:
and
for (int i = 0; i < count; i++)
{
values[i%SIZE] = divideC( 3.0 );
}
results in:
mov rdx, QWORD PTR [rsp+8]
test rdx, rdx
je .L4
mov eax, 0
.L5:
add eax, 1
cmp edx, eax
jne .L5
.L4:
So both will result in the identical machine code, only containing the counting of the loop and nothing else. So as soon as you turn on optimizations you will only measure the loop but nothing related to constexpr.
With -O2 even the loop is optimized away, and you would only measure:
clock_t time1 = clock();
time1 = clock() - time1;
cout << "Time1: " << float(time1)/float(CLOCKS_PER_SEC) << " seconds" << endl;
Related
I am testing the performance of the C++ standard library algorithms and encountered a weird thing.
Here is my code to compare the performance of std::count vs plain for loop:
#include <algorithm>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std::chrono;
int my_count(const std::vector<int>& v, int val) {
int num = 0;
for (int i: v) {
if (i == val)
num++;
}
return num;
}
int main()
{
int total_count = 0;
std::vector<int> v;
v.resize(100000000);
// Fill vector
for (int i = 0; i < v.size(); i++) {
v[i] = i % 10000;
}
int val = 1;
{
auto start = high_resolution_clock::now();
total_count += std::count(v.begin(), v.end(), val);
auto stop = high_resolution_clock::now();
std::cout << "std::count time: " << duration_cast<microseconds>(stop - start).count() << std::endl;
}
{
auto start = high_resolution_clock::now();
total_count += my_count(v, val);
auto stop = high_resolution_clock::now();
std::cout << "my_count time: " << duration_cast<microseconds>(stop - start).count() << std::endl;
}
// We need this so the compiler does not prune the code above
std::cout << "Total items: " << total_count << std::endl;
}
With MinGW I get this:
std::count time: 65827
my_count time: 64861
And with MSVC I get a pretty weird result:
std::count time: 65532
my_count time: 28584
The MinGW result seems reasonable since, as far as I know, STL count function if roughly equal to the plain for loop, but the MSVC result seems weird - why the plain for loop is more than 2x faster than std::count?
These results are reproducible on my machine - it's not something that occurs once, but it occurs each time I run the code. I even tried changing the function order, running multiple for loops to avoid caching or branch prediction bias, but I still get the same result.
Is there any reason for this?
This is because MSVC vectorizes your manually written code, but is unable to do the same for std::count.
This is how vectorized code looks:
movdqa xmm5, XMMWORD PTR __xmm#00000001000000010000000100000001
and rcx, -8
xorps xmm3, xmm3
xorps xmm2, xmm2
npad 3
$LL4#my_count:
movdqu xmm1, XMMWORD PTR [rax]
add r8, 8
movdqa xmm0, xmm5
paddd xmm0, xmm3
pcmpeqd xmm1, xmm4
pand xmm0, xmm1
pandn xmm1, xmm3
movdqa xmm3, xmm0
movdqa xmm0, xmm5
por xmm3, xmm1
paddd xmm0, xmm2
movdqu xmm1, XMMWORD PTR [rax+16]
add rax, 32 ; 00000020H
pcmpeqd xmm1, xmm4
pand xmm0, xmm1
pandn xmm1, xmm2
movdqa xmm2, xmm0
por xmm2, xmm1
cmp r8, rcx
jne SHORT $LL4#my_count
You can see how it load 4 ones in xmm5 register at the beginning. This value will be used to maintain 4 separate counters that track result for 1st, 2nd, 3rd and 4th DWORDs. Once counting is done, those 4 values will be added together to form the result of the function.
The issue with MSVC vectorizer seems to lie in the fact that counter, data type and argument type should be "compatible":
Return type should match in size the data type
Argument type should be equal or less in size to data type
If any of those constraints is not met, the code is not vectorized. This makes sense as if your data type is 32-bit wide you have to operate on 32-bit counters to make them work together, so if your return type is 64-bit wide instead some additional manipulations are required (which is what GCC is able to do, but this still slows down std::count compared to manually written loop).
This is the case where manually written loop should be preferred as subtle differences in semantic (int return type) make it easier to vectorize (even for GCC, which generates shorter code).
Well, that seems to be an iterator issue.
I've made an extended test:
#include <algorithm>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std::chrono;
int std_count(const std::vector<int>& v, int val) {
return std::count(v.begin(), v.end(), val);
}
int my_count_for(const std::vector<int>& v, int val) {
int num = 0;
for (int i = 0; i < v.size(); i++) {
if (v[i] == val) {
num++;
}
}
return num;
}
int my_count_for_in(const std::vector<int>& v, int val) {
int num = 0;
for (int i : v) {
if (i == val) {
num++;
}
}
return num;
}
int my_count_iter(const std::vector<int>& v, int val) {
int num = 0;
for (auto i = v.begin(); i < v.end(); i++) {
if (*i == val) {
num++;
}
}
return num;
}
int main()
{
std::vector<int> v;
v.resize(1000000);
// Fill vector
for (int i = 0; i < v.size(); i++) {
v[i] = i % 10000;
}
int val = 1;
int num_iters = 1000;
int total_count = 0;
for (int a = 0; a < 3; a++) {
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += std_count(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "std::count time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_for(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_for time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_for_in(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_for_in time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_iter(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_iter time: " << duration.count() << std::endl;
}
std::cout << std::endl;
}
std::cout << total_count << std::endl;
std::cin >> total_count;
}
And here's what I get:
std::count time: 679683
my_count_for time: 235269
my_count_for_in time: 228185
my_count_iter time: 650714
std::count time: 656192
my_count_for time: 231248
my_count_for_in time: 231050
my_count_iter time: 652598
std::count time: 660295
my_count_for time: 238812
my_count_for_in time: 225893
my_count_iter time: 648812
Still seems quite weird that STL function is not the fastest way to solve the task. If someone knows the detailed answer, please share it with me.
I was working on the code to find prime number, and during my work i became curious how exactly % operation in C++ works in low level.
Firstly, I wrote some code to compare elapsed time of '%' operator and '>>' operator, each.
#include <iostream>
#include <chrono>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
using namespace std;
bool remainder1(int x);
bool remainder2(int y);
void timeCompare(bool(*f)(int), bool(*g)(int));
// I want to check which one is faster, x % 2 Vs. (x >> 1) & 1
int main()
{
srand(time(NULL));
for (int i = 0; i < 10; i++) {
timeCompare(remainder1, remainder2);
}
return 0;
}
// % 2 operation
bool remainder1(int x) {
if (x % 128) return true;
else return false;
}
bool remainder2(int x) {
if ((x >> 7) & 1) return true;
else return false;
}
void timeCompare(bool(*f)(int), bool(*g)(int)) {
srand(time(NULL));
auto start = chrono::steady_clock::now();
for (int i = 0; i < 10000000; i++) {
int x = rand();
f(x);
}
auto end = chrono::steady_clock::now();
cout << "Elapsed time in nanoseconds : "
<< chrono::duration_cast<chrono::nanoseconds>(end - start).count()
<< " ns";
auto start2 = chrono::steady_clock::now();
for (int i = 0; i < 10000000; i++) {
int x = rand();
g(x);
}
auto end2 = chrono::steady_clock::now();
cout << " Vs. "
<< chrono::duration_cast<chrono::nanoseconds>(end2 - start2).count()
<< " ns" << endl;
}
And the output is this :
Elapsed time in nanoseconds : 166158000 ns Vs. 218736000 ns
Elapsed time in nanoseconds : 151776000 ns Vs. 214823000 ns
Elapsed time in nanoseconds : 162193000 ns Vs. 217248000 ns
Elapsed time in nanoseconds : 151338000 ns Vs. 211793000 ns
Elapsed time in nanoseconds : 150346000 ns Vs. 211295000 ns
Elapsed time in nanoseconds : 155799000 ns Vs. 215265000 ns
Elapsed time in nanoseconds : 148801000 ns Vs. 212839000 ns
Elapsed time in nanoseconds : 149813000 ns Vs. 226175000 ns
Elapsed time in nanoseconds : 152324000 ns Vs. 213338000 ns
Elapsed time in nanoseconds : 149353000 ns Vs. 216809000 ns
So it seems like shift operation is slower in finding remainder. I guessed the reason is that shift version needs one more comparison than '%' version... Am I correct?
I really want to know how '%' works in lower level!
I really want to know how '%' works in lower level!
If you're asking how it is implemented then the answer is that chances are the CPU you're using has a single instruction for modulo (%). For example, take this C++ code:
int main()
{
int x = 100;
int mod = x % 128;
int shift = x >> 7;
return 0;
}
The generated x86 assembly code (Clang 6.0.0) for it is:
main:
push rbp
mov rbp, rsp
xor eax, eax
mov ecx, 128
mov dword ptr [rbp - 4], 0
mov dword ptr [rbp - 8], 100
mov edx, dword ptr [rbp - 8] # Start of modulo boilerplater
mov dword ptr [rbp - 20], eax
mov eax, edx
cdq
idiv ecx # Modulo CPU instruction
mov dword ptr [rbp - 12], edx # End of modulo sequence
mov ecx, dword ptr [rbp - 8] # Start of shift boilerplate
sar ecx, 7 # Shift CPU instruction
mov dword ptr [rbp - 16], ecx # End of shift sequence
mov ecx, dword ptr [rbp - 20]
mov eax, ecx
pop rbp
ret
The idiv instruction is called the Signed Divide, and it places the quotient in EAX/RAX and the remainder in EDX/RDX for (x86/x64 accordingly).
I guessed the reason is that shift version needs one more comparison
than '%' version... Is my correct?
No comparisons are being done in this case, since it's a single instruction.
I have the following AVX and Native codes:
__forceinline double dotProduct_2(const double* u, const double* v)
{
_mm256_zeroupper();
__m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m256d temp = _mm256_hadd_pd(xy, xy);
__m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));
return dotproduct.m128d_f64[0];
}
__forceinline double dotProduct_1(const D3& a, const D3& b)
{
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
And respective test scripts:
std::cout << res_1 << " " << res_2 << " " << res_3 << '\n';
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "NAIVE : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "AVX : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
std::cout << math::min2(zx_1, zx_2) << " " << zx_1 << " " << zx_2;
Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..)
And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)
Simple profiling with basic loops isn't a great idea - it usually just means you are memory bandwidth limited, so the tests end up coming out at about the same speed (memory is typically slower than the CPU, and that's basically all you are testing here).
As others have said, your code example isn't great, because you are constantly going across the lanes (which I assume is just to find the fastest dot product, and not specifically because a sum of all the dot products is the desired result?). To be honest, if you really need a fast dot product (for AOS data as presented here), I think I would prefer to replace the VHADDPD with a VADDPD + VPERMILPD (trading an additional instruction for twice the throughput, and a lower latency)
double dotProduct_3(const double* u, const double* v)
{
__m256d dp = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m128d a = _mm256_extractf128_pd(dp, 0);
__m128d b = _mm256_extractf128_pd(dp, 1);
__m128d c = _mm_add_pd(a, b);
__m128d yy = _mm_unpackhi_pd(c, c);
__m128d dotproduct = _mm_add_pd(c, yy);
return _mm_cvtsd_f64(dotproduct);
}
asm:
dotProduct_3(double const*, double const*):
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vextractf128 xmm1,ymm0,0x1
vaddpd xmm0,xmm1,xmm0
vpermilpd xmm1,xmm0,0x3
vaddpd xmm0,xmm1,xmm0
vzeroupper
ret
Generally speaking, if you are using horizontal adds, you're doing it wrong! Whilst a 256bit register may seem ideal for a Vector4d, it's not actually a particularly great representation (especially if you consider that AVX512 is now available!). A very similar question to this came up recently: For C++ Vector3 utility class implementations, is array faster than struct and class?
If you want performance, then structure-of-arrays is the best way to go.
struct HybridVec4SOA
{
__m256d x;
__m256d y;
__m256d z;
__m256d w;
};
__m256d dot(const HybridVec4SOA& a, const HybridVec4SOA& b)
{
return _mm256_fmadd_pd(a.w, b.w,
_mm256_fmadd_pd(a.z, b.z,
_mm256_fmadd_pd(a.y, b.y,
_mm256_mul_pd(a.x, b.x))));
}
asm:
dot(HybridVec4SOA const&, HybridVec4SOA const&):
vmovapd ymm1,YMMWORD PTR [rdi+0x20]
vmovapd ymm2,YMMWORD PTR [rdi+0x40]
vmovapd ymm3,YMMWORD PTR [rdi+0x60]
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vfmadd231pd ymm0,ymm1,YMMWORD PTR [rsi+0x20]
vfmadd231pd ymm0,ymm2,YMMWORD PTR [rsi+0x40]
vfmadd231pd ymm0,ymm3,YMMWORD PTR [rsi+0x60]
ret
If you compare the latencies (and more importantly throughput) of load/mul/fmadd compared to hadd and extract, and then consider that the SOA version is computing 4 dot products at a time (instead of 1), you'll start to understand why it's the way to go...
You add too much overhead with vzeroupper and hadd instructions. Good way to write it, is to do all multiplies in a loop and aggregate the result just once at the end. Imagine you unroll original loop 4 times and use 4 accumulators:
for(i=0; i < (1<<30); i+=4) {
s0 += a[i+0] * b[i+0];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
}
return s0+s1+s2+s3;
And now just replace unrolled loop with SIMD mul and add (or even FMA intrinsic if available)
The following code shows a big performance difference of the two versions of min_3 on my machine (Windows 7, VC++ 2015, release).
#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
template <typename X>
const X& max_3_left( const X& a, const X& b, const X& c )
{
return std::max( std::max( a, b ), c );
}
template <typename X>
const X& max_3_right( const X& a, const X& b, const X& c )
{
return std::max( a, std::max( b, c ) );
}
int main()
{
std::random_device r;
std::default_random_engine e1( r() );
std::uniform_int_distribution<int> uniform_dist( 1, 6 );
std::vector<int> numbers;
for ( int i = 0; i < 1000; ++i )
numbers.push_back( uniform_dist( e1 ) );
auto start1 = std::chrono::high_resolution_clock::now();
int sum1 = 0;
for ( int i = 0; i < 1000; ++i )
for ( int j = 0; j < 1000; ++j )
for ( int k = 0; k < 1000; ++k )
sum1 += max_3_left( numbers[i], numbers[j], numbers[k] );
auto finish1 = std::chrono::high_resolution_clock::now();
std::cout << "left " << sum1 << " " <<
std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1).count()
<< " us" << std::endl;
auto start2 = std::chrono::high_resolution_clock::now();
int sum2 = 0;
for ( int i = 0; i < 1000; ++i )
for ( int j = 0; j < 1000; ++j )
for ( int k = 0; k < 1000; ++k )
sum2 += max_3_right( numbers[i], numbers[j], numbers[k] );
auto finish2 = std::chrono::high_resolution_clock::now();
std::cout << "right " << sum2 << " " <<
std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2).count()
<< " us" << std::endl;
}
Output:
left 739861041 796056 us
right 739861041 1442495 us
On ideone the difference is smaller but still not negligible.
Why does this difference exist?
gcc and clang (and presumably MSVC) fail to realize that max is an associative operation like addition. v[i] max (v[j] max v[k]) (max_3_right) is the same as (v[i] max v[j]) max v[k] (max_3_left). I'm writing max as an infix operator to point out the similarity with + and other associative operations.
Since v[k] is the only input that's changing inside the inner loop, it's obviously a big win to hoist the (v[i] max v[j]) out of the inner loop.
To see what's actually going on, we as always have to look at the asm. To make it easy to find the asm for the loops, I split them out into separate functions. (Making one template functions with the max3 function as a parameter would be more C++-like). This has the added advantage of taking the code we want optimized out of main, which gcc marks as "cold", disabling some optimizations.
#include <algorithm>
#define SIZE 1000
int sum_maxright(const std::vector<int> &v) {
int sum = 0;
for ( int i = 0; i < SIZE; ++i )
for ( int j = 0; j < SIZE; ++j )
for ( int k = 0; k < SIZE; ++k )
sum += max_3_right( v[i], v[j], v[k] );
return sum;
}
The inner-most loop of that compiles to (gcc 5.3 targeting x86-64 Linux ABI with -std=gnu++11 -fverbose-asm -O3 -fno-tree-vectorize -fno-unroll-loops -march=haswell with some hand annotations)
## from outer loops: rdx points to v[k] (starting at v.begin()). r8 is v.end(). (r10 is v.begin)
## edi is v[i], esi is v[j]
## eax is sum
## inner loop. See the full asm on godbolt.org, link below
.L10:
cmp DWORD PTR [rdx], esi # MEM[base: _65, offset: 0], D.92793
mov ecx, esi # D.92793, D.92793
cmovge ecx, DWORD PTR [rdx] # ecx = max(v[j], v[k])
cmp ecx, edi # D.92793, D.92793
cmovl ecx, edi # ecx = max(ecx, v[i])
add rdx, 4 # pointer increment
add eax, ecx # sum, D.92793
cmp rdx, r8 # ivtmp.253, D.92795
jne .L10 #,
Clang 3.8 makes similar code for the max_3_right loop, with two cmov instructions inside the inner loop. (Use the compiler dropdown in the Godbolt Compiler Explorer to see.)
gcc and clang both optimize the way you'd expect for the max_3_left loop, hoisting everything but a single cmov out of the inner loop.
## register allocation is slightly different here:
## esi = max(v[i], v[j]). rdi = v.end()
.L2:
cmp DWORD PTR [rdx], ecx # MEM[base: _65, offset: 0], D.92761
mov esi, ecx # D.92761, D.92761
cmovge esi, DWORD PTR [rdx] # MEM[base: _65, offset: 0],, D.92761
add rdx, 4 # ivtmp.226,
add eax, esi # sum, D.92761
cmp rdx, rdi # ivtmp.226, D.92762
jne .L2 #,
So there's much less going on in this loop. (On Intel pre-Broadwell, cmov is a 2-uop instruction, so one fewer cmov is a big deal.)
BTW, cache prefetching effects can't possibly explain this:
The inner loop accesses numbers[k] sequentially. The repeated accesses to numbers[i] and numbers[j] are hoisted out of the inner loop by any decent compiler, and wouldn't confuse modern prefetchers even if they weren't.
Intel's optimization manual says that up to 32 streams of prefetch patterns can be detected and maintained (with a limit of one forward and one backward per 4k page), for Sandybridge-family microarchitectures (section 2.3.5.4 Data Prefetching).
The OP completely failed to say anything about what hardware he ran this microbenchmark on, but since real compilers hoist the other loads leaving only the most trivial access pattern, it hardly matters.
one vector of 1000 ints (4B) only takes 4kiB. This means the whole array easily fits in L1D cache, so there's no need for any kind of prefetching in the first place. It all stays hot in L1 cache pretty much the entire time.
As molbdnilo pointed out, the problem could be with the order of loops. When calculating sum1, the code can be rewritten as:
for ( int i = 0; i < 1000; ++i )
for ( int j = 0; j < 1000; ++j ) {
auto temp = std::max(numbers[i], numbers[j]);
for ( int k = 0; k < 1000; ++k )
sum1 += std::max(temp, numbers[k]);
}
The same cannot be applied for the calculation of sum2. However, when I reoredered the second loops as:
for ( int j = 0; j < 1000; ++j )
for ( int k = 0; k < 1000; ++k )
for ( int i = 0; i < 1000; ++i )
sum2 += ...;
I got the same times for both calculations. (Moreover, both calculations are much faster with -O3 than with -O2. The former seemingly turns on vectorization according to disassembled output.)
This is related to data cache prefetching at hardware level.
If you use the left associative version, the elements of the array are used/loaded in the sequence expected by CPU cache, with less latency.
The right associative version breaks the prediction and will generate more cache misses, hence the slower performance.
I'm computing the mean and variance of an array using SSE intrinsics. Basically, this is the summation of the values and its squares which can be illustrated in the following program:
int main( int argc, const char* argv[] )
{
union u
{
__m128 m;
float f[4];
} x;
// Allocate memory and initialize data: [1,2,3,...stSize+1]
const size_t stSize = 1024;
float *pData = (float*) _aligned_malloc(stSize*sizeof(float), 32);
for ( size_t s = 0; s < stSize; ++s ) {
pData[s] = s+1;
}
// Sum and sum of squares
{
// Accumlation using SSE intrinsics
__m128 mEX = _mm_set_ps1(0.f);
__m128 mEXX = _mm_set_ps1(0.f);
for ( size_t s = 0; s < stSize; s+=4 )
{
__m128 m = _mm_load_ps(pData + s);
mEX = _mm_add_ps(mEX, m);
mEXX = _mm_add_ps(mEXX, _mm_mul_ps(m,m));
}
// Final reduction
x.m = mEX;
double dEX = x.f[0] + x.f[1] + x.f[2] + x.f[3];
x.m = mEXX;
double dEXX = x.f[0] + x.f[1] + x.f[2] + x.f[3];
std::cout << "Sum expected: " << (stSize * stSize + stSize) / 2 << std::endl;
std::cout << "EX: " << dEX << std::endl;
std::cout << "Sum of squares expected: " << 1.0/6.0 * stSize * (stSize + 1) * (2 * stSize + 1) << std::endl;
std::cout << "EXX: " << dEXX << std::endl;
}
// Clean up
_aligned_free(pData);
}
Now when I compile and run the program in Debug mode I get the following (and correct) output:
Sum expected: 524800
EX: 524800
Sum of squares expected: 3.58438e+008
EXX: 3.58438e+008
However, compiling and running the program in Release mode the following (and wrong) results are produced:
Sum expected: 524800
EX: 524800
Sum of squares expected: 3.58438e+008
EXX: 3.49272e+012
Changing the order of accumulation, i.e. EXX is updated before EX, the results are OK:
Sum expected: 524800
EX: 524800
Sum of squares expected: 3.58438e+008
EXX: 3.58438e+008
Looks like a 'counterproductive' compiler optimization or why is the order of execution relevant? Is this a known bug?
EDIT:
I just looked at the assembler output. Here is what I get (only the relevant parts).
For the release build with /arch:AVX compiler flag we have:
; 69 : // Second test: sum and sum of squares
; 70 : {
; 71 : __m128 mEX = _mm_set_ps1(0.f);
vmovaps xmm1, XMMWORD PTR __xmm#0
mov ecx, 256 ; 00000100H
; 72 : __m128 mEXX = _mm_set_ps1(0.f);
vmovaps xmm2, xmm1
npad 12
$LL3#main:
; 73 : for ( size_t s = 0; s < stSize; s+=4 )
; 74 : {
; 75 : __m128 m = _mm_load_ps(pData + s);
vmovaps xmm0, xmm1
; 76 : mEX = _mm_add_ps(mEX, m);
vaddps xmm1, xmm1, XMMWORD PTR [rax]
add rax, 16
; 77 : mEXX = _mm_add_ps(mEXX, _mm_mul_ps(m,m));
vmulps xmm0, xmm0, xmm0
vaddps xmm2, xmm0, xmm2
dec rcx
jne SHORT $LL3#main
This is clearly wrong as this (1) saves the accumulated EX result (xmm1) in xmm0 (2) accumulates EX with the current value (XMMWORD PTR [rax]) and (3) accumulates in EXX (xmm2) the square of the accumulated EX result previously save in xmm0.
In contrast, the version without the /arch:AVX looks fine and as expected:
; 69 : // Second test: sum and sum of squares
; 70 : {
; 71 : __m128 mEX = _mm_set_ps1(0.f);
movaps xmm1, XMMWORD PTR __xmm#0
mov ecx, 256 ; 00000100H
; 72 : __m128 mEXX = _mm_set_ps1(0.f);
movaps xmm2, xmm1
npad 10
$LL3#main:
; 73 : for ( size_t s = 0; s < stSize; s+=4 )
; 74 : {
; 75 : __m128 m = _mm_load_ps(pData + s);
movaps xmm0, XMMWORD PTR [rax]
add rax, 16
dec rcx
; 76 : mEX = _mm_add_ps(mEX, m);
addps xmm1, xmm0
; 77 : mEXX = _mm_add_ps(mEXX, _mm_mul_ps(m,m));
mulps xmm0, xmm0
addps xmm2, xmm0
jne SHORT $LL3#main
This really looks like a bug. Can anyone comfirm or refute this issue with a different compiler version? (I currently do not have permission to update the compiler)
Instead of manually performing the horizontal addition, I'd recommend using the corresponding SSE instruction _mm_hadd_ps
// Final reduction
__m128 sum1 = _mm_hadd_ps(mEX, mEXX);
// == {EX[0]+EX[1], EX[2]+EX[3], EXX[0]+EXX[1], EXX[2]+EXX[3]}
// final sum and conversion to double:
__m128d sum2 = _mm_cvtps_pd(_mm_hadd_ps(sum1, sum1));
// result vector:
double dEX_EXX[2]; // (I don't know MSVC syntax for stack aligned arrays)
// store register to stack: (should be _mm_store_pd, if the array is aligned)
_mm_storeu_pd(dEX_EXX, sum2);
std::cout << "EX: " << dEX_EXX[0] << "\nEXX: " << dEX_EXX[1] << std::endl;