Executable runs faster on Wine than Windows -- why? - c++

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}

edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.

Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.

From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.

While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.

Related

(Why) is the std::binomial_distribution biased for large probabilities p and slow for small n?

I want to generate binomially distributed random numbers in c++. Speed is a major concern. Not knowing a lot about random number generators, I use the standard libraries' tools. My code looks like something below:
#include <random>
static std::random_device random_dev;
static std::mt19937 random_generator{random_dev()};
std::binomial_distribution<int> binomial_generator;
void RandomInit(int s) {
//I create the generator object here to save time. Does this make sense?
binomial_generator = std::binomial_distribution<int>(1, 0.5);
random_generator.seed(s);
}
int binomrand(int n, double p) {
binomial_generator.param(std::binomial_distribution<int>::param_type(n, p));
return binomial_generator(random_generator);
}
To test my implementation, I have built a cython wrapper and then executed and timed the function from within python. For reference I have also implemented a "stupid" binomial distribution, which just returns the sum of Bernoulli trials.
int binomrand2(int n, double p) {
int result = 0;
for (int i = 0; i<n; i++) {
if (_Random() < p) //_Random is a thoroughly tested custom random number generator on U[0,1)
result++;
}
return result;
}
Timing showed that the latter implementation is about 50% faster than the former if n < 25. Furthermore, for p = 0.95, the former yielded significantly biased results (the mean over 1000000 trials for n = 40 was 38.23037; standard deviation is 0.0014; the result was reproducable with different seeds).
Is this a (known) issue with the standard library's functions or is my implementation wrong? What could I do to achieve my goal of obtaining accurate results with high efficiency?
The parameter n will mostly be below 100 and smaller values will occur more frequently.
I am open to suggestions outside the realm of the standard library, but I may not be able to use external software libraries.
I am using the VC 2019 compiler on 64bit Windows.
Edit
I have also tested the bias without using python:
double binomrandTest(int n, double p, long long N) {
long long result = 0;
for (long long i = 0; i<N; i++) {
result += binomrand(n, p);
}
return ((double) result) / ((double) N);
}
The result remained biased (38.228045 for the parameters above, where something like 38.000507 would be expected).

Generating a random bit stream in Rcpp efficiently

I have an auxiliary function in the R package I'm currently building named rbinom01. Note that it calls random(3).
int rbinom01(int size) {
if (!size) {
return 0;
}
int64_t result = 0;
while (size >= 32) {
result += __builtin_popcount(random());
size -= 32;
}
result += __builtin_popcount(random() & ~(LONG_MAX << size));
return result;
}
When R CMD check my_package, I got the following warning:
* checking compiled code ... NOTE
File ‘ my_package/libs/my_package.so’:
Found ‘_random’, possibly from ‘random’ (C)
Object: ‘ my_function.o’
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor use Fortran I/O
nor system RNGs.
See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.
I headed to the Document, and it says I can use one of the *_rand function, along with a family of distribution functions. Well that's cool, but my package simply needs a stream of random bits rather than a random double. The easiest way I can have it is by using random(3) or maybe reading from /dev/urandom, but that makes my package "unportable".
This post suggests using sample, but unfortunately it doesn't fit into my use case. For my application, generating random bits is apparently critical to the performance, so I don't want it waste any time calling unif_rand, multiply the result by N and round it. Anyway, the reason I'm using C++ is to exploit bit-level parallelism.
Surely I can hand-roll my own PRNG or copy and paste the code of a state-of-the-art PRNG like xoshiro256**, but before doing that I would like to see if there are any easier alternatives.
Incidentally, could someone please link a nice short tutorial of Rcpp to me? Writing R Extensions is comprehensive and awesome but it would take me weeks to finish. I'm looking for a more concise version, but preferably it should be more informative than a call to Rcpp.package.skeleton.
As suggested by #Ralf Stubner's answer, I have re-wrote the original code as follow. However, I'm getting the same result every time. How can I seed it properly and at the same time keep my code "portable"?
int rbinom01(int size) {
dqrng::xoshiro256plus rng;
if (!size) {
return 0;
}
int result = 0;
while (size >= 64) {
result += __builtin_popcountll(rng());
Rcout << sizeof(rng()) << std::endl;
size -= 64;
}
result += __builtin_popcountll(rng() & ((1LLU << size) - 1));
return result;
}
There are different R packages that make PRNGs available as C++ header only libraries:
BH: Everything from boost.random
sitmo: Various Threefry versions
dqrng: PCG family, xoshiro256+ and xoroshiro128+
...
You can make use of any of these by adding LinkingTo to your package's DECRIPTION. Typically these PRNGs are modeled after the C++11 random header, which means you have to control their life-cycle and seeding yourself. In a single-threaded environment I like to use anonymous namespaces for life-cycle control, e.g.:
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
// [[Rcpp::plugins(cpp11)]]
namespace {
dqrng::xoshiro256plus rng{};
}
// [[Rcpp::export]]
void set_seed(int seed) {
rng.seed(seed);
}
// [[Rcpp::export]]
int rbinom01(int size) {
if (!size) {
return 0;
}
int result = 0;
while (size >= 64) {
result += __builtin_popcountll(rng());
size -= 64;
}
result += __builtin_popcountll(rng() & ((1LLU << size) - 1));
return result;
}
/*** R
set_seed(42)
rbinom01(10)
rbinom01(10)
rbinom01(10)
*/
However, using runif isn't all bad and certainly faster than accessing /dev/urandom. In dqrng there is a convenient wrapper for this.
As for tutorials: Besides WRE the Rcpp package vignette is a must read. R Packages by Hadley Wickham also has a chapter on "compiled code" if you want to go the devtools-way.

Why is my C++ code three times slower than the C equivalent on LeetCode? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've been doing some of the LeetCode problems, and I notice that the C solutions are a couple of times faster than the exact same thing in C++. For example:
Updated with a couple of simpler examples:
Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. You may assume no duplicates in the array. (Link to question on LeetCode)
My solution in C, runs in 3 ms:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
My other C++ solution, exactly the same but as a member function of the Solution class runs in 13 ms:
class Solution {
public:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
};
Even simpler example:
Reverse the digits of an integer. Return 0 if the result will overflow. (Link to question on LeetCode)
The C version runs in 6 ms:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
And the C++ version is exactly the same but as a member function of the Solution class, and runs for 19 ms:
class Solution {
public:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
};
I see how there would be considerable overhead from using vector of vector as a 2D array in the original example if the LeetCode testing system doesn't compile the code with optimisation enabled. But the simpler examples above shouldn't suffer that issue because the data structures are pretty raw, especially in the second case where all you have is long or integer arithmetics. That's still slower by a factor of three.
I'm starting to think that there might be something odd happening with the way LeetCode do the benchmarking in general because even in the C version of the integer reversing problem you get a huge bump in running time from just replacing the line
if (rev>(-1U >> 1) || rev < (1 << 31)) {
with
if (rev>INT_MAX || rev < INT_MIN) {
Now, I suppose having to #include<limits.h> might have something to do with that but it seems a bit extreme that this simple change bumps the execution time from just 6 ms to 19 ms.
Lately I've been seeing the vector<vector<int>> suggestion a lot for doing 2d arrays in C++, and I've been pointing out to people why this really isn't a good idea. It's a handy trick to know when slapping together temporary code, but there's (almost) never any reason to ever use it for real code. The right thing to do is to use a class that wraps a contiguous block of memory.
So my first reaction might be to point to this as a possible source for the disparity. However you're also using int** in the C version, which is generally a sign of the exact same problem as vector<vector<int>>.
So instead I decided to just compare the two solutions.
http://coliru.stacked-crooked.com/a/fa8441cc5baa0391
6468424
6588511
That's the time taken by the 'C version' vs the 'C++ version' in nanoseconds.
My results don't show anything like the disparity you describe. Then it occurred to me to check a common mistake people make when benchmarking
http://coliru.stacked-crooked.com/a/e57d791876b9252b
18386695
42400612
Notice that the -O3 flag from the first example has become -O0, which disables optimization.
Conclusion: you're probably comparing unoptimized executables.
C++ supports building rich abstractions that don't require overhead, but eliminating the the overhead does require certain code transformations that play havoc with the 'debuggability' of code.
That means debug builds avoid those transformations and therefore C++ debug builds are often slower than debug builds of C style code because C style code just doesn't use much abstraction. Seeing a 130% slowdown such as the above is not at all surprising when timing, for example, machine code that uses function calls in place of simple store instructions.
Some code really needs optimizations in order to have reasonable performance even for debugging, so compilers often offer a mode that applies some optimizations which don't cause too much trouble for debuggers. Clang and gcc use -O1 for this, and you can see that even this level of optimization essentially eliminates the gap in this program between C style code and the more C++ style code:
http://coliru.stacked-crooked.com/a/13967ebcfcfa4073
8389992
8196935
Update:
In those later examples optimization shouldn't make a difference, since the C++ is not using any abstraction beyond what the C version is doing. I'm guessing that the explanation for this is that the examples are being compiled with different compilers or with some other different compiler options. Without knowing how the compilation is done I would say it makes no sense to compare these runtime numbers; LeetCode is clearly not producing an apples to apples comparison.
You are using vector of vector in your C++ code snippet. Vectors are sequence containers in C++ that are like arrays that can change in size. Instead of vector<vector<int>> if you use statically allocated arrays, that would be better. You may use your own Array class as well with operator [] overloaded, but vector has more overhead as it dynamically resizes when you add more elements than its original size. In C++, you use call by reference to further reduce your time if you compare that with C. C++ should run even faster if written well.

Suggestion for chkstk.asm stackoverflow exception in C++ with Visual Studio

I am working with an implementation of merge sort. I am trying with C++ Visual Studio 2010 (msvc). But when I took a array of 300000 integers for timing, it is showing an unhandled stackoverflow exception and taking me to a readonly file named "chkstk.asm". I reduced the size to 200000 and it worked. Again the same code worked with C-free 4 editor (mingw 2.95) without any problem while the size was 400000. Do you have any suggestion to get the code working in Visual Studio?
May be the recursion in the mergesort is causing the problem.
Problem solved. Thanks to Kotti for supplying the code. I got the problem while comparing with that code. The problem was not about too much recursion. Actually I was working with a normal C++ array which was being stored on stack. Thus the problem ran out of stack space. I just changed it to a dynamically allocated array with the new/delete statements and it worked.
I'm not exactly sure, but this may be a particular problem of your implementation of yor merge sort (that causes stack overflow). There are plenty of good implementations (use google), the following works on VS2008 with array size = 2000000.
(You could try it in VS2010)
#include <cstdlib>
#include <memory.h>
// Mix two sorted tables in one and split the result into these two tables.
void Mix(int* tab1, int *tab2, int count1, int count2)
{
int i,i1,i2;
i = i1 = i2 = 0;
int * temp = (int *)malloc(sizeof(int)*(count1+count2));
while((i1<count1) && (i2<count2))
{
while((i1<count1) && (*(tab1+i1)<=*(tab2+i2)))
{
*(temp+i++) = *(tab1+i1);
i1++;
}
if (i1<count1)
{
while((i2<count2) && (*(tab2+i2)<=*(tab1+i1)))
{
*(temp+i++) = *(tab2+i2);
i2++;
}
}
}
memcpy(temp+i,tab1+i1,(count1-i1)*sizeof(int));
memcpy(tab1,temp,count1*sizeof(int));
memcpy(temp+i,tab2+i2,(count2-i2)*sizeof(int));
memcpy(tab2,temp+count1,count2*sizeof(int));
free(temp);
}
void MergeSort(int *tab,int count) {
if (count == 1) return;
MergeSort(tab, count/2);
MergeSort(tab + count/2, (count + 1) /2);
Mix(tab, tab + count / 2, count / 2, (count + 1) / 2);
}
void main() {
const size_t size = 2000000;
int* array = (int*)malloc(sizeof(int) * size);
for (int i = 0; i < size; ++i) {
array[i] = rand() % 5000;
}
MergeSort(array, size);
}
My guess is that you've got so much recursion that you're just running out of stack space. You can increase your stack size with the linker's /F command line option. But, if you keep hitting stack size limits you probably want to refactor the recursion out of your algorithm.
_chkstk() refers to "Check Stack". This happens in Windows by default. It can be disabled with /Gs- option or allocating reasonably high size like /Gs1000000. The other way is to disable this function using:
#pragma check_stack(off) // place at top header to cover all the functions
Official documentation.
Reference.

Performance problems when scaling MSVC 2005's operator<< accross threads

When looking at some of our logging I've noticed in the profiler that we were spending a lot of time in the operator<< formatting ints and such. It looks like there is a shared lock that is used whenever ostream::operator<< is called when formatting an int(and presumably doubles). Upon further investigation I've narrowed it down to this example:
Loop1 that uses ostringstream to do the formatting:
DWORD WINAPI doWork1(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
out << "[0";
for (int j = 1; j < 100; ++j)
out << ", " << j;
out << "]\n";
}
return 0;
}
Loop2 that uses the same ostringstream to do everything but the int format, that is done with itoa:
DWORD WINAPI doWork2(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
char buffer[13];
out << "[0";
for (int j = 1; j < 100; ++j)
{
_itoa_s(j, buffer, 10);
out << ", " << buffer;
}
out << "]\n";
}
return 0;
}
For my test I ran each loop a number of times with 1, 2, 3 and 4 threads (I have a 4 core machine). The number of trials is constant. Here is the output:
doWork1: all ostringstream
n Total
1 557
2 8092
3 15916
4 15501
doWork2: use itoa
n Total
1 200
2 112
3 100
4 105
As you can see, the performance when using ostringstream is abysmal. It gets 30 times worse when adding more threads whereas the itoa gets about 2 times faster.
One idea is to use _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) as recommended by M$ in this article. That doesn't seem to help me. Here's another user who seem to be having a similar issue.
We need to be able to format ints in several threads running in parallel for our application. Given this issue we either need to figure out how to make this work or find another formatting solution. I may code up a simple class with operator<< overloaded for the integral and floating types and then have a templated version that just calls operator<< on the underlying stream. A bit ugly, but I think I can make it work, though maybe not for user defined operator<<(ostream&,T) because it's not an ostream.
I should also make clear that this is being built with Microsoft Visual Studio 2005. And I believe this limitation comes from their implementation of the standard library.
If the Visual Studio 2005's standard library implementation has bugs why not try other implementations? Like:
STLport
Apache C++ Standard Library (STDCXX)
or even Dinkumware upon which Visual Studio 2005 standard library is based on, maybe the have fixed the problem since 2005.
Edit: The other user you mentioned used Visual Studio 2008 SP1, which means that probably Dinkumware has not fixed this issue.
Doesn't surprise me, MS has put "global" locks on a fair few shared resources - the biggest headache for us was the BSTR memory lock a few years back.
The best thing you can do is copy the code and replace the ostream lock and shared conversion memory with your own class. I have done that where I write the stream using a printf-style logging system (ie I had to use a printf logger, and wrapped it with my stream operators). Once you've compiled that into your app you should be as fast as itoa. When I'm in the office I'll grab some of the code and paste it for you.
EDIT:
as promised:
CLogger& operator<<(long l)
{
if (m_LoggingLevel < m_levelFilter)
return *this;
// 33 is the max length of data returned from _ltot
resize(33);
_ltot(l, buffer+m_length, m_base);
m_length += (long)_tcslen(buffer+m_length);
return *this;
};
static CLogger& hex(CLogger& c)
{
c.m_base = 16;
return c;
};
void resize(long extra)
{
if (extra + m_length > m_size)
{
// resize buffer to fit.
TCHAR* old_buffer = buffer;
m_size += extra;
buffer = (TCHAR*)malloc(m_size*sizeof(TCHAR));
_tcsncpy(buffer, old_buffer, m_length+1);
free(old_buffer);
}
}
static CLogger& endl(CLogger& c)
{
if (c.m_length == 0 && c.m_LoggingLevel < c.m_levelFilter)
return c;
c.Write();
return c;
};
Sorry I can't let you have all of it, but those 3 methods show the basics - I allocate a buffer, resize it if needed (m_size is buffer size, m_length is current text length) and keep it for the duration of the logging object. The buffer contents get written to file (or OutputDebugString, or a listbox) in the endl method. I also have a logging 'level' to restrict output at runtime. So you just replace your calls to ostringstream with this, and the Write() method pumps the buffer to a file and clears the length. Hope this helps.
The problem could be memory allocation. malloc which "new" uses has an internal lock. You can see it if you step into it. Try to use a thread local allocator and see if the bad performance disappears.