This question already has answers here:
Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
(2 answers)
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:
g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx
When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.
Could someone please explain what the cause of this may be?
Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.
If I skip the AVX test routines I never see this performance loss.
Here is my SSE2 routine
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m128 mul = _mm_set_ps1(ratio);
unsigned int i;
for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
{
__m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
}
for (; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
And the AVX version of the same.
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m256 mul = _mm256_set1_ps(ratio);
unsigned int i;
for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
{
__m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
out[4] = ((int16_t*)&con)[8];
out[5] = ((int16_t*)&con)[10];
out[6] = ((int16_t*)&con)[12];
out[7] = ((int16_t*)&con)[14];
}
for(; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
I have also run this through valgrind which detects no errors.
Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.
As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:
References:
Intel: Avoiding AVX-SSE Transition Penalties
Intel® AVX State Transitions: Migrating SSE Code to AVX
Related
I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".
For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.
Before I elaborate the specifics, I have the following function,
Let _e, _w be an array of equal size. Let _stepSize be of float type.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
This function is well and fine, but if a cpu have intrinsic, specifically for this example, SSE4, then function below allows me to shave seconds off (for the same input) even with -O3 gcc flag already included for both and extra -msse4a added for this one.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Problem:
My problem now is I want something like this,
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
#ifdef _mssa4a_defined_
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
#else // No intrinsic
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
#endif
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Thus, if in gcc, I declared -msse4a to compile this code, then it will pick compile the code in if statement. And of course, my plan is to implement it for all intrinsic, not just for SSE4A above.
GCC, ICC (on Linux), and Clang have the following compile options with corresponding defines
options define
-mfma __FMA__
-mavx2 __AVX2__
-mavx __AVX__
-msse4.2 __SSE4_2__
-msse4.1 __SSE4_1__
-mssse3 __SSSE3__
-msse3 __SSE3__
-msse2 __SSE2__
-m64 __SSE2__
-msse __SSE__
Options and defines in GCC and Clang but not in ICC:
-msse4a __SSE4A__
-mfma4 __FMA4__
-mxop __XOP__
AVX512 options which are defined in recent versions of GCC, Clang, and ICC
-mavx512f __AVX512F__ //foundation instructions
-mavx512pf __AVX512PF__ //pre-fetch instructions
-mavx512er __AVX512ER__ //exponential and reciprocal instructions
-mavx512cd __AVX512CD__ //conflict detection instructions
AVX512 options which will likely be in GCC, Clang, and ICC soon (if not already):
-mavx512bw __AVX512BW__ //byte and word instructions
-mavx512dq __AVX512DQ__ //doubleword and quadword Instructions
-mavx512vl __AVX512VL__ //vector length extensions
Note that many of these switches enable several more: e.g -mfma enables and defines AVX2, AVX, SSE4.2 SSE4.1, SSSE3, SSE3, SSE2, SSE.
I'm not 100% what the compiler options with ICC for AVX512 is. It could be -xMIC-AVX512 instead of -mavx512f.
MSVC only appears to define __AVX__ and __AVX2__.
In your case your code appears only to be using SSE2 so if you compile in 64-bit mode (which is the default in a 64-bit user space or explicitly with -m64) then __SSE2__ is defined. But since you used -msse4a then __SSE4A__ will be defined as well.
Note that enabling an instruction is not the same as determine if an instruction set is available. If you want your code to work on multiple instruction sets then I suggest a CPU dispatcher.
I later learned that there is no way to do this. A simple and elegant way though is this. For x86-64 platform with sse4a intrinsic, do the following make target-rule (assuming that you store intrinsic sources in src/intrinsic/ and builds (.o files) in build/* ):
CXX=g++ -O3
CXXFLAGS=-std=c++14 -Wunused
CPPFLAGS=
CPP_INTRINSIC_FLAG:=-ffast-math
INTRINSIC_OBJECT := $(patsubst src/intrinsic/%.cpp,build/%.o,$(wildcard src/intrinsic/*.cpp))
x86-64-sse4: $(eval CPP_INTRINSIC_FLAG+=-msse4a -DSSE4A) $(INTRINSIC_OBJECT)
# Intrinsic objects
build/%.o: src/intrinsic/%.cpp
$(CXX) $(CXXFLAGS) -c $(CPPFLAGS) $(CPP_INTRINSIC_FLAG) $(INCLUDE_PATHS) $^ -o $#
This question already has answers here:
What's the difference between -O3 and (-O2 + flags that man gcc says -O3 adds to -O2)?
(2 answers)
Closed 8 years ago.
Here's the function I'm looking at:
template <uint8_t Size>
inline uint64_t parseUnsigned( const char (&buf)[Size] )
{
uint64_t val = 0;
for (uint8_t i = 0; i < Size; ++i)
if (buf[i] != ' ')
val = (val * 10) + (buf[i] - '0');
return val;
}
I have a test harness which passes in all possible numbers with Size=5, left-padded with spaces. I'm using GCC 4.7.2. When I run the program under callgrind after compiling with -O3 I get:
I refs: 7,154,919
When I compile with -O2 I get:
I refs: 9,001,570
OK, so -O3 improves the performance (and I confirmed that some of the improvement comes from the above function, not just the test harness). But I don't want to completely switch from -O2 to -O3, I want to find out which specific option(s) to add. So I consult man g++ to get the list of options it says are added by -O3:
-fgcse-after-reload [enabled]
-finline-functions [enabled]
-fipa-cp-clone [enabled]
-fpredictive-commoning [enabled]
-ftree-loop-distribute-patterns [enabled]
-ftree-vectorize [enabled]
-funswitch-loops [enabled]
So I compile again with -O2 followed by all of the above options. But this gives me even worse performance than plain -O2:
I refs: 9,546,017
I discovered that adding -ftree-vectorize to -O2 is responsible for this performance degradation. But I can't figure out how to match the -O3 performance with any combination of options. How can I do this?
In case you want to try it yourself, here's the test harness (put the above parseUnsigned() definition under the #includes):
#include <cmath>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
template <uint8_t Size>
inline void increment( char (&buf)[Size] )
{
for (uint8_t i = Size - 1; i < 255; --i)
{
if (buf[i] == ' ')
{
buf[i] = '1';
break;
}
++buf[i];
if (buf[i] > '9')
buf[i] -= 10;
else
break;
}
}
int main()
{
char str[5];
memset(str, ' ', sizeof(str));
unsigned max = std::pow(10, sizeof(str));
for (unsigned ii = 0; ii < max; ++ii)
{
uint64_t result = parseUnsigned(str);
if (result != ii)
{
printf("parseUnsigned(%*s) from %u: %lu\n", sizeof(str), str, ii, result);
abort();
}
increment(str);
}
}
A very similar question was already answered here: https://stackoverflow.com/a/6454659/483486
I've copied the relevant text underneath.
UPDATE: There are questions about it in GCC WIKI:
"Is -O1 (-O2,-O3 or -Os) equivalent to individual -foptimization options?"
No. First, individual optimization options (-f*) do not enable optimization, an option -Os or -Ox with x > 0 is required. Second, the -Ox flags enable many optimizations that are not controlled by any individual -f* option. There are no plans to add individual options for controlling all these optimizations.
"What specific flags are enabled by -O1 (-O2, -O3 or -Os)?"
Varies by platform and GCC version. You can get GCC to tell you what flags it enables by doing this:
touch empty.c
gcc -O1 -S -fverbose-asm empty.c
cat empty.s
In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?
As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).
I am considering the following C++ program:
#include <iostream>
#include <limits>
int main(int argc, char **argv) {
unsigned int sum = 0;
for (unsigned int i = 1; i < std::numeric_limits<unsigned int>::max(); ++i) {
double f = static_cast<double>(i);
unsigned int t = static_cast<unsigned int>(f);
sum += (t % 2);
}
std::cout << sum << std::endl;
return 0;
}
I use the gcc / g++ compiler, g++ -v gives gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux).
I am running openSUSE 12.3 (x86_64) and have a Intel(R) Core(TM) i7-3520M CPU.
Running
g++ -O3 test.C -o test_64_opt
g++ -O0 test.C -o test_64_no_opt
g++ -m32 -O3 test.C -o test_32_opt
g++ -m32 -O0 test.C -o test_32_no_opt
time ./test_64_opt
time ./test_64_no_opt
time ./test_32_opt
time ./test_32_no_opt
yields
2147483647
real 0m4.920s
user 0m4.904s
sys 0m0.001s
2147483647
real 0m16.918s
user 0m16.851s
sys 0m0.019s
2147483647
real 0m37.422s
user 0m37.308s
sys 0m0.000s
2147483647
real 0m57.973s
user 0m57.790s
sys 0m0.011s
Using float instead of double, the optimized 64 bit variant even finishes in 2.4 seconds, while the other running times stay roughly the same. However, with float I get different outputs depending on optimization, probably due to the higher processor-internal precision.
I know 64 bit may have faster math, but we have a factor of 7 (and nearly 15 with floats) here.
I would appreciate an explanation of these running time discrepancies.
The problem isn't 32bit vs 64bit, it's the lack of SSE and SSE2. When compiling for 64bit, gcc assumes it can use SSE and SSE2 since all available x86_64 processors have it.
Compile your 32bit version with -msse -msse2 and the runtime difference nearly disappears.
My benchmark results for completeness:
-O3 -m32 -msse -msse2 4.678s
-O3 (64bit) 4.524s