how do i write assembly in C++ for adding (time test)

how do i write assembly in C++ for adding (time test) - c++

I want to time how much slower it is if i were to do a simple operation like 1+1 and plus(int l,int r) which does l+r and throws an exception on overflow heres some example code with _C and _V being carry and overflow. The exception code can be written differently if you like.
How do i write it so i can quickly test for carry/overflow and throw an exception if it is true?
I never did jumps (or even some of the basics) in assembly so i a bit clueless even after googling.
This should work in x32 comps. Currently i am running on a Intel Core Duo which has the x86, x86-64 set
unsigned int plus(unsigned int l, unsigned int r){
unsigned int v = l+r;
if (!_C) return v;
throw 1;
}
int plus(int l, int r){
int v = l+r;
if (!_V) return v;
throw 1;
}

Do you want C/C++ code to perform these operations, or do you want to know how to do it in x86 assembly language?
In C/C++, determining the carry is easy:
int _C = (v < l); // "v < r" works too
The overflow is a bit more complicated. Normally overflow is flagged when the two operands have the same sign yet the result has a different sign. On two's complement architectures such as x86, this can be written as:
int _V = ((l ^ r) >= 0) && ((l ^ v) < 0);
The MSB (sign bit) of l ^ r will be 0 if and only if the signs agree, and similarly l ^ v will have a nonzero sign bit (=value less than zero) if and only if l and v have opposite signs.
If you want to write it in assembly, you just do the add and use a jc or jo respectively to jump to the carry/overflow handler. However, you can't easily throw C++ exceptions from assembly code. The easiest way to do this is probably to write a simple one-line function in C++ that throws the exception and call that from your assembly code. The final asm code will look something like this:
; Assuming eax=l, ebx=r
add eax, ebx
jc .have_carry
; Continue computation here...
.have_carry:
call throw_overflow_exception
with the following C++ helper function defined somewhere:
extern "C" void throw_overflow_exception()
{
throw 1; // or some other exception
}
You need the extern C to disable C++ name mangling for this function. There are other conventions to (e.g. some compilers add an underscore before or after C function names) - this depends on the compiler used and the architecture though.

Shift bit both inputs right by one. Add them both then test the left-most bit.
l>>1;
r>>1;
int result = l+r;
if (result>>((sizeof(int)*8)-1)) { /* handle overflow */ }

Here is the test code i end up using
-edit- I tested on VS 2010 outside of the IDE and in GCC. gcc is a lot faster as VC optimizes the variables poorly. (VS c++ moves the register back into V when assembly is used. It doesnt need to do that.) gcc shows little between the 3 test. The results varied from each run so often that they all looked like it was the same test. I couldnt tell. VS showed checking C flag without asm about 3x slower while with asm 6x slower.
#include <stdio.h>
#include <intrin.h>
int main(int argc, char*argv[])
{
try{
for(int n=0; n<10; n++){
volatile unsigned int v, r;
sscanf("0", "%d", &v);
sscanf("0", "%d", &r);
__int64 time = 0xFFFFFFFF;
__int64 start = __rdtsc();
for(int i=0; i<100000000; i++)
{
v=v+v;
#if 1
__asm jc e
continue;
e:
throw 1;
#endif
}
__int64 end = __rdtsc();
time = end - start;
printf("time: %I64d\n", time/10000000);
}
}
catch(int v)
{
printf("exception");
}
return 0;
}

Related

Can not flip sign

I found a weird bug that happens when i try to flip the sign of the number -9223372036854775808, which does simply nothing.
I get the same number back or at least that's what the debugger shows me.
Is there a way to solve this without branching?
#define I64_MAX 9223372036854775807LL
#define I64_MIN (-I64_MAX-1)
// -9223372036854775808 (can not be a constant in code as it will turn to ull)
using i64 = long long int;
int main()
{
i64 i = I64_MIN;
i = -i;
printf("%lld",i);
return 0;
}
Does the same thing with i32,i16,i8.
EDIT:
Current Fix:
// use template??
c8* szi32(i32 num,c8* in)
{
u32 number = S(u32,num);
if(num < 0)
{
in[0] = '-';
return SerializeU32(number,&in[1]);
}
else
{
return SerializeU32(number,in);
}
}

You can't do it in a completely portable way. Rather than dealing with int64_t, let us consider int8_t. The principle is almost exactly the same, but the numbers are much easier to deal with. I8_MAX will be 127, and I8_MIN will be -128. Negating I8_MIN will give 128, and there is no way to store that in int8_t.
Unless you have strong evidence that this is a bottleneck, then the right answer is:
constexpr int8_t negate(int8_t i) {
return (i==I8_MIN) ? I8_MAX : -i;
}
If you do have such evidence, then you will need to investigate some platform dependent code - perhaps a compiler intrinsic of some sort, perhaps some clever bit-twiddling which avoids a conditional jump.
Edit: Possible branchless bit-twiddling
constexpr int8_t negate(int8_t i) {
const auto ui = static_cast<uint8_t>(i);
// This will calculate the two's complement negative of ui.
const uint8_t minus_ui = ~ui+1;
// This will have the top bit set if, and only if, i was I8_MIN
const uint8_t top_bit = ui & minus_ui;
// Need to get top_bit into the 1 bit. Either use a compiler intrinsic rotate:
const int8_t bottom_bit = static_cast<int8_t>(rotate_left(top_bit)) & 1;
// -or- hope that your implementation does something sensible when you
// shift a negative number (most do).
const int8_t arithmetic_shifted = static_cast<int8_t>(top_bit) >> 7;
const int8_t bottom_bit = arithmetic_shifted & 1;
// Either way, at this point, bottom_bit is 1 if and only if i was
// I8_MIN, otherwise it is zero.
return -(i+bottom_bit);
}
You would need to profile to determine whether that is actually faster. Another option would be to shift top_bit into the carry bit, and use add-with-carry (adding a constant zero), or write it in assembler, and use an appropriate conditionally executed instruction.

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".

For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

bit test and set (BTS) on a tbb atomic variable

I want to do bitTestAndSet on a tbb atomic variable.
atomic.h from tbb does not seem to have any bit operations.
If I treat the tbb atomic variable as a normal pointer and do __sync_or_and_fetch gcc compiler doesn't allow that.
Is there a workaround for this?
Related question:
assembly intrinsic for bit test and set (BTS)

A compare_and_swap loop can be used, like this:
// Atomically perform i|=j. Return previous value of i.
int bitTestAndSet( tbb::atomic<int>& i, int j ) {
int o = i; // Atomic read (o = "old value")
while( (o|j)!=o ) { // Loop exits if another thread sets the bits
int k = o;
o = i.compare_and_swap(k|j,k);
if( o==k ) break; // Successful swap
}
return o;
}
Note that if the while condition succeeds on the first try, there will be only an acquire fence, not a full fence. Whether that matters depends on context.
If there is risk of high contention, then some sort of backoff scheme should be be used in the loop. TBB uses a class atomic_backoff for contention management internally, but it's not currently part of the public TBB API.
There is a second way, if portability is not a concern and you are willing to exploit the undocumented fact that the layout of a tbb::atomic and T are the same on x86 platforms. In that case, just operate on the tbb::atomic using assembly code. The program below demonstrates this technique:
#include <tbb/tbb.h>
#include <cstdio>
inline int SetBit(int array[], int bit) {
int x=1, y=0;
asm("bts %2,%0\ncmovc %3,%1" : "+m" (*array), "+r"(y) : "r" (bit), "r"(x));
return y;
}
tbb::atomic<int> Flags;
volatile int Result;
int main() {
for( int i=0; i<16; ++i ) {
int k = i*i%32;
std::printf("bit at %2d was %d. Flags=%8x\n", k, SetBit((int*)&Flags,k), +Flags);
}
}

Poor performance of vector<bool> in 64-bit target with VS2012

Benchmarking this class:
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= (int)sqrt((double)n); ++i)
if (isPrime[i])
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
}
};
I'm getting over 3 times worse performance (CPU time) with 64-bit binary vs. 32-bit version (release build) when calling a constructor for a large number, e.g.
Sieve s(100000000);
I tested sizeof(bool) and it is 1 for both versions.
When I substitute vector<bool> with vector<char> performance becomes the same for 64-bit and 32-bit versions. Why is that?
Here are the run times for S(100000000) (release mode, 32-bit first, 64-bit second)):
vector<bool> 0.97s 3.12s
vector<char> 0.99s 0.99s
vector<int> 1.57s 1.59s
I also did a sanity test with VS2010 (prompted by Wouter Huysentruit's response), which produced 0.98s 0.88s. So there is something wrong with VS2012 implementation.
I submitted a bug report to Microsoft Connect
EDIT
Many answers below comment on deficiencies of using int for indexing. This may be true, but even the Great Wizard himself is using a standard for (int i = 0; i < v.size(); ++i) in his books, so such a pattern should not incur a significant performance penalty. Additionally, this issue was raised during Going Native 2013 conference and the presiding group of C++ gurus commented on their early recommendations of using size_t for indexing and as a return type of size() as a mistake. They said: "we are sorry, we were young..."
The title of this question could be rephrased to: Over 3 times performance drop on this code when upgrading from VS2010 to VS2012.
EDIT
I made a crude attempt at finding memory alignment of indexes i and j and discovered that this instrumented version:
struct Sieve {
vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= sqrt((double)n); ++i) {
if (i == 17) cout << ((int)&i)%16 << endl;
if (isPrime[i])
for (int j = i*i; j <= n; j += i) {
if (j == 4) cout << ((int)&j)%16 << endl;
isPrime[j] = false;
}
}
}
};
auto-magically runs fast now (only 10% slower than 32-bit version). This and VS2010 performance makes it hard to accept a theory of optimizer having inherent problems dealing with int indexes instead of size_t.

std::vector<bool> is not directly at fault here. The performance difference is ultimately caused by your use of the signed 32-bit int type in your loops and some rather poor register allocation by the compiler. Consider, for example, your innermost loop:
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
Here, j is a 32-bit signed integer. When it is used in isPrime[j], however, it must be promoted (and sign-extended) to a 64-bit integer, in order to perform the subscript computation. The compiler can't just treat j as a 64-bit value, because that would change the behavior of the loop (e.g. if n is negative). The compiler also can't perform the index computation using the 32-bit quantity j, because that would change the behavior of that expression (e.g. if j is negative).
So, the compiler needs to generate code for the loop using a 32-bit j then it must generate code to convert that j to a 64-bit integer for the subscript computation. It has to do the same thing for the outer loop with i. Unfortunately, it looks like the compiler allocates registers rather poorly in this case(*)--it starts spilling temporaries to the stack, causing the performance hit you see.
If you change your program to use size_t everywhere (which is 32-bit on x86 and 64-bit on x64), you will observe that the performance is on par with x86, because the generated code needs only work with values of one size:
Sieve (size_t n = 1)
{
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (size_t i = 2; i <= static_cast<size_t>(sqrt((double)n)); ++i)
if (isPrime[i])
for (size_t j = i*i; j <= n; j += i)
isPrime[j] = false;
}
You should do this anyway, because mixing signed and unsigned types, especially when those types are of different widths, is perilous and can lead to unexpected errors.
Note that using std::vector<char> also "solves" the problem, but for a different reason: the subscript computation required for accessing an element of a std::vector<char> is substantially simpler than that for accessing an element of std::vector<bool>. The optimizer is able to generate better code for the simpler computations.
(*) I don't work on code generation, and I'm hardly an expert in either assembly or low-level performance optimization, but from looking at the generated code, and given that it is reported here that Visual C++ 2010 generates better code, I'd guess that there are opportunities for improvement in the compiler. I'll make sure the Connect bug you opened gets forwarded on to the compiler team so they can take a look.

I've tested this with vector<bool> in VS2010: 32-bit needs 1452ms while 64-bit needs 1264ms to complete on a i3.
The same test in VS2012 (on i7 this time) needs 700ms (32-bit) and 2730ms (64-bit), so there is something wrong with the compiler in VS2012. Maybe you can report this test case as a bug to Microsoft.
UPDATE
The problem is that the VS2012 compiler uses a temporary stack variable for a part of the code in the inner for-loop when using int as iterator. The assembly parts listed below are part of the code inside <vector>, in the += operator of the std::vector<bool>::iterator.
size_t as iterator
When using size_t as iterator, a part of the code looks like this:
or rax, -1
sub rax, rdx
shr rax, 5
lea rax, QWORD PTR [rax*4+4]
sub r8, rax
Here, all instructions use CPU registers which are very fast.
int as iterator
When using int as iterator, that same part looks like this:
or rcx, -1
sub rcx, r8
shr rcx, 5
shl rcx, 2
mov rax, -4
sub rax, rcx
mov rdx, QWORD PTR _Tmp$6[rsp]
add rdx, rax
Here you see the _Tmp$6 stack variable being used, which causes the slowdown.
Point compiler into the right direction
The funny part is that you can point the compiler into the right direction by using the vector<bool>::iterator directly.
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign(n + 1, true);
std::vector<bool>::iterator it1 = isPrime.begin();
std::vector<bool>::iterator end = it1 + n;
*it1++ = false;
*it1++ = false;
for (int i = 2; i <= (int)sqrt((double)n); ++it1, ++i)
if (*it1)
for (std::vector<bool>::iterator it2 = isPrime.begin() + i * i; it2 <= end; it2 += i)
*it2 = false;
}
};

vector<bool> is a very special container that's specialized to use 1 bit per item rather than providing normal container semantics. I suspect that the bit manipulation logic is much more expensive when compiling 64 bits (either it still uses 32 bit chunks to hold the bits or some other reason). vector<char> behaves just like a normal vector so there's no special logic.
You could also use deque<bool> which doesn't have the specialization.

Executable runs faster on Wine than Windows -- why?

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}

edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.

Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.

From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.

While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js