Generating a random bit stream in Rcpp efficiently - c++

I have an auxiliary function in the R package I'm currently building named rbinom01. Note that it calls random(3).
int rbinom01(int size) {
if (!size) {
return 0;
}
int64_t result = 0;
while (size >= 32) {
result += __builtin_popcount(random());
size -= 32;
}
result += __builtin_popcount(random() & ~(LONG_MAX << size));
return result;
}
When R CMD check my_package, I got the following warning:
* checking compiled code ... NOTE
File ‘ my_package/libs/my_package.so’:
Found ‘_random’, possibly from ‘random’ (C)
Object: ‘ my_function.o’
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor use Fortran I/O
nor system RNGs.
See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.
I headed to the Document, and it says I can use one of the *_rand function, along with a family of distribution functions. Well that's cool, but my package simply needs a stream of random bits rather than a random double. The easiest way I can have it is by using random(3) or maybe reading from /dev/urandom, but that makes my package "unportable".
This post suggests using sample, but unfortunately it doesn't fit into my use case. For my application, generating random bits is apparently critical to the performance, so I don't want it waste any time calling unif_rand, multiply the result by N and round it. Anyway, the reason I'm using C++ is to exploit bit-level parallelism.
Surely I can hand-roll my own PRNG or copy and paste the code of a state-of-the-art PRNG like xoshiro256**, but before doing that I would like to see if there are any easier alternatives.
Incidentally, could someone please link a nice short tutorial of Rcpp to me? Writing R Extensions is comprehensive and awesome but it would take me weeks to finish. I'm looking for a more concise version, but preferably it should be more informative than a call to Rcpp.package.skeleton.
As suggested by #Ralf Stubner's answer, I have re-wrote the original code as follow. However, I'm getting the same result every time. How can I seed it properly and at the same time keep my code "portable"?
int rbinom01(int size) {
dqrng::xoshiro256plus rng;
if (!size) {
return 0;
}
int result = 0;
while (size >= 64) {
result += __builtin_popcountll(rng());
Rcout << sizeof(rng()) << std::endl;
size -= 64;
}
result += __builtin_popcountll(rng() & ((1LLU << size) - 1));
return result;
}

There are different R packages that make PRNGs available as C++ header only libraries:
BH: Everything from boost.random
sitmo: Various Threefry versions
dqrng: PCG family, xoshiro256+ and xoroshiro128+
...
You can make use of any of these by adding LinkingTo to your package's DECRIPTION. Typically these PRNGs are modeled after the C++11 random header, which means you have to control their life-cycle and seeding yourself. In a single-threaded environment I like to use anonymous namespaces for life-cycle control, e.g.:
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
// [[Rcpp::plugins(cpp11)]]
namespace {
dqrng::xoshiro256plus rng{};
}
// [[Rcpp::export]]
void set_seed(int seed) {
rng.seed(seed);
}
// [[Rcpp::export]]
int rbinom01(int size) {
if (!size) {
return 0;
}
int result = 0;
while (size >= 64) {
result += __builtin_popcountll(rng());
size -= 64;
}
result += __builtin_popcountll(rng() & ((1LLU << size) - 1));
return result;
}
/*** R
set_seed(42)
rbinom01(10)
rbinom01(10)
rbinom01(10)
*/
However, using runif isn't all bad and certainly faster than accessing /dev/urandom. In dqrng there is a convenient wrapper for this.
As for tutorials: Besides WRE the Rcpp package vignette is a must read. R Packages by Hadley Wickham also has a chapter on "compiled code" if you want to go the devtools-way.

Related

(Why) is the std::binomial_distribution biased for large probabilities p and slow for small n?

I want to generate binomially distributed random numbers in c++. Speed is a major concern. Not knowing a lot about random number generators, I use the standard libraries' tools. My code looks like something below:
#include <random>
static std::random_device random_dev;
static std::mt19937 random_generator{random_dev()};
std::binomial_distribution<int> binomial_generator;
void RandomInit(int s) {
//I create the generator object here to save time. Does this make sense?
binomial_generator = std::binomial_distribution<int>(1, 0.5);
random_generator.seed(s);
}
int binomrand(int n, double p) {
binomial_generator.param(std::binomial_distribution<int>::param_type(n, p));
return binomial_generator(random_generator);
}
To test my implementation, I have built a cython wrapper and then executed and timed the function from within python. For reference I have also implemented a "stupid" binomial distribution, which just returns the sum of Bernoulli trials.
int binomrand2(int n, double p) {
int result = 0;
for (int i = 0; i<n; i++) {
if (_Random() < p) //_Random is a thoroughly tested custom random number generator on U[0,1)
result++;
}
return result;
}
Timing showed that the latter implementation is about 50% faster than the former if n < 25. Furthermore, for p = 0.95, the former yielded significantly biased results (the mean over 1000000 trials for n = 40 was 38.23037; standard deviation is 0.0014; the result was reproducable with different seeds).
Is this a (known) issue with the standard library's functions or is my implementation wrong? What could I do to achieve my goal of obtaining accurate results with high efficiency?
The parameter n will mostly be below 100 and smaller values will occur more frequently.
I am open to suggestions outside the realm of the standard library, but I may not be able to use external software libraries.
I am using the VC 2019 compiler on 64bit Windows.
Edit
I have also tested the bias without using python:
double binomrandTest(int n, double p, long long N) {
long long result = 0;
for (long long i = 0; i<N; i++) {
result += binomrand(n, p);
}
return ((double) result) / ((double) N);
}
The result remained biased (38.228045 for the parameters above, where something like 38.000507 would be expected).

Truly asynchronous file IO in C++

I have a super fast M.2 drive. How fast is it? It doesn’t matter because I cannot utilize this speed anyway. That’s why I’m asking this question.
I have an app that needs a real lot of memory. So much that it won’t fit in RAM. Fortunately it is not needed all at once. Instead it is used to save intermediate results from computations.
Unfortunately the application is not able to write and reads this data fast enough. I tried using multiple reader and writer threads but it only made it worse (later I read that it is because of this).
So my question is: Is it possible to have truly asynchronous file IO in C++ to fully exploit those advertised gigabytes per second? If it is than how (in a cross platform way)?
You could also recommend a library that’s good with tasks like that if you know one because I believe that there is no point in reinventing the wheel.
Edit:
Here is code that shows how I do file IO in my program. It isn't from the mentioned program because it wouldn't be that minimal. This one ilustrates the problem nevertheless. Do not mind Windows.h. It is used only to set thread affinity. In the actual program I also set affinity , so that's why I included it.
#include <fstream>
#include <thread>
#include <memory>
#include <string>
#include <Windows.h> // for SetThreadAffinityMask()
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
void lock_thread(unsigned core_idx)
{
SetThreadAffinityMask(GetCurrentThread(), 1LL << core_idx);
}
int main()
{
std::ios_base::sync_with_stdio(false);
lock_thread(0);
auto worker_count = std::thread::hardware_concurrency() - 1;
std::unique_ptr<std::thread[]> threads = std::make_unique<std::thread[]>(worker_count); // faster than std::vector
for (int i = 0; i < worker_count; ++i)
{
threads[i] = std::thread(
[](unsigned idx) {
lock_thread(idx);
stress_write(1'000'000'000, idx);
},
i + 1
);
}
stress_write(1'000'000'000, 0);
for (int i = 0; i < worker_count; ++i)
{
threads[i].join();
}
}
As you can see its just plain old fstream. On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). I am wondering if it could be increased.
The easiest thing to get (up to) a 10x speed up is to change this:
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
to this:
void stress_write(unsigned bytes, int num)
{
constexpr auto chunk_size = (1u << 12u); // tune as needed
std::ofstream out("temp" + std::to_string(num));
for (unsigned chunk = 0; chunk < (bytes+chunk_size-1)/chunk_size; ++chunk)
{
char chunk_buff[chunk_size];
auto count = (std::min)( bytes - chunk_size*chunk, chunk_size );
for (unsigned j = 0; j < count; ++j)
{
unsigned i = j + chunk_size*chunk;
chunk_buff[j] = char(i); // processing
}
out.write( chunk_buff, count );
}
}
where we group writes up to 4096 bytes before sending to the std ofstream.
The streaming operations have a number of annoying, hard for compilers to elide, virtual calls that dominate performance when you are writing only a handful of bytes at a time.
By chunking data into larger pieces we make the vtable lookups rare enough that they no longer dominate.
See this SO post for more details asto why.
To get the last iota of performance, you may have to use something like boost.asio or access your platforms raw async file io libraries.
But when you are working at < 10% of the drive bandwidth while railing your CPU, aim at low hanging fruit first.
Chunking the I/O is indeed the most important optimization here and should suffice in most cases. However, the direct answer to the exact question asked about asynchronous IO is the following.
Boost::Asio added support for file operations in version 1.21.0. The interface is similar to the rest of Asio.
First, we need to create an object representing a file. The most common use cases would use either a random_access_file or a stream_file. In case of this example code, a streaming file is enough.
Reading is done through async_read_some, but the usual async_read helper function can be used to read a specific number of bytes at once.
If the operating system supports that, these operations do indeed run in the background and use little processor time. Both Windows and Linux do support this.

Why do std::string operations perform poorly?

I made a test to compare string operations in several languages for choosing a language for the server-side application. The results seemed normal until I finally tried C++, which surprised me a lot. So I wonder if I had missed any optimization and come here for help.
The test are mainly intensive string operations, including concatenate and searching. The test is performed on Ubuntu 11.10 amd64, with GCC's version 4.6.1. The machine is Dell Optiplex 960, with 4G RAM, and Quad-core CPU.
in Python (2.7.2):
def test():
x = ""
limit = 102 * 1024
while len(x) < limit:
x += "X"
if x.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0:
print("Oh my god, this is impossible!")
print("x's length is : %d" % len(x))
test()
which gives result:
x's length is : 104448
real 0m8.799s
user 0m8.769s
sys 0m0.008s
in Java (OpenJDK-7):
public class test {
public static void main(String[] args) {
int x = 0;
int limit = 102 * 1024;
String s="";
for (; s.length() < limit;) {
s += "X";
if (s.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ") > 0)
System.out.printf("Find!\n");
}
System.out.printf("x's length = %d\n", s.length());
}
}
which gives result:
x's length = 104448
real 0m50.436s
user 0m50.431s
sys 0m0.488s
in Javascript (Nodejs 0.6.3)
function test()
{
var x = "";
var limit = 102 * 1024;
while (x.length < limit) {
x += "X";
if (x.indexOf("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) > 0)
console.log("OK");
}
console.log("x's length = " + x.length);
}();
which gives result:
x's length = 104448
real 0m3.115s
user 0m3.084s
sys 0m0.048s
in C++ (g++ -Ofast)
It's not surprising that Nodejs performas better than Python or Java. But I expected libstdc++ would give much better performance than Nodejs, whose result really suprised me.
#include <iostream>
#include <string>
using namespace std;
void test()
{
int x = 0;
int limit = 102 * 1024;
string s("");
for (; s.size() < limit;) {
s += "X";
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != string::npos)
cout << "Find!" << endl;
}
cout << "x's length = " << s.size() << endl;
}
int main()
{
test();
}
which gives result:
x length = 104448
real 0m5.905s
user 0m5.900s
sys 0m0.000s
Brief Summary
OK, now let's see the summary:
javascript on Nodejs(V8): 3.1s
Python on CPython 2.7.2 : 8.8s
C++ with libstdc++: 5.9s
Java on OpenJDK 7: 50.4s
Surprisingly! I tried "-O2, -O3" in C++ but noting helped. C++ seems about only 50% performance of javascript in V8, and even poor than CPython. Could anyone explain to me if I had missed some optimization in GCC or is this just the case? Thank you a lot.
It's not that std::string performs poorly (as much as I dislike C++), it's that string handling is so heavily optimized for those other languages.
Your comparisons of string performance are misleading, and presumptuous if they are intended to represent more than just that.
I know for a fact that Python string objects are completely implemented in C, and indeed on Python 2.7, numerous optimizations exist due to the lack of separation between unicode strings and bytes. If you ran this test on Python 3.x you will find it considerably slower.
Javascript has numerous heavily optimized implementations. It's to be expected that string handling is excellent here.
Your Java result may be due to improper string handling, or some other poor case. I expect that a Java expert could step in and fix this test with a few changes.
As for your C++ example, I'd expect performance to slightly exceed the Python version. It does the same operations, with less interpreter overhead. This is reflected in your results. Preceding the test with s.reserve(limit); would remove reallocation overhead.
I'll repeat that you're only testing a single facet of the languages' implementations. The results for this test do not reflect the overall language speed.
I've provided a C version to show how silly such pissing contests can be:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (memmem(s, size, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Timing:
matt#stanley:~/Desktop$ time ./smash
x's length = 104448
real 0m0.681s
user 0m0.680s
sys 0m0.000s
So I went and played a bit with this on ideone.org.
Here a slightly modified version of your original C++ program, but with the appending in the loop eliminated, so it only measures the call to std::string::find(). Note that I had to cut the number of iterations to ~40%, otherwise ideone.org would kill the process.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 42 * 1024;
unsigned int found = 0;
//std::string s;
std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
//s += 'X';
if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org are time: 3.37s. (Of course, this is highly questionably, but indulge me for a moment and wait for the other result.)
Now we take this code and swap the commented lines, to test appending, rather than finding. Note that, this time, I had increased the number of iterations tenfold in trying to see any time result at all.
#include <iostream>
#include <string>
int main()
{
const std::string::size_type limit = 1020 * 1024;
unsigned int found = 0;
std::string s;
//std::string s(limit, 'X');
for (std::string::size_type i = 0; i < limit; ++i) {
s += 'X';
//if (s.find("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0) != std::string::npos)
// ++found;
}
if(found > 0)
std::cout << "Found " << found << " times!\n";
std::cout << "x's length = " << s.size() << '\n';
return 0;
}
My results at ideone.org, despite the tenfold increase in iterations, are time: 0s.
My conclusion: This benchmark is, in C++, highly dominated by the searching operation, the appending of the character in the loop has no influence on the result at all. Was that really your intention?
The idiomatic C++ solution would be:
#include <iostream>
#include <string>
#include <algorithm>
int main()
{
const int limit = 102 * 1024;
std::string s;
s.reserve(limit);
const std::string pattern("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
for (int i = 0; i < limit; ++i) {
s += 'X';
if (std::search(s.begin(), s.end(), pattern.begin(), pattern.end()) != s.end())
std::cout << "Omg Wtf found!";
}
std::cout << "X's length = " << s.size();
return 0;
}
I could speed this up considerably by putting the string on the stack, and using memmem -- but there seems to be no need. Running on my machine, this is over 10x the speed of the python solution already..
[On my laptop]
time ./test
X's length = 104448
real 0m2.055s
user 0m2.049s
sys 0m0.001s
That is the most obvious one: please try to do s.reserve(limit); before main loop.
Documentation is here.
I should mention that direct usage of standard classes in C++ in the same way you are used to do it in Java or Python will often give you sub-par performance if you are unaware of what is done behind the desk. There is no magical performance in language itself, it just gives you right tools.
My first thought is that there isn't a problem.
C++ gives second-best performance, nearly ten times faster than Java. Maybe all but Java are running close to the best performance achievable for that functionality, and you should be looking at how to fix the Java issue (hint - StringBuilder).
In the C++ case, there are some things to try to improve performance a bit. In particular...
s += 'X'; rather than s += "X";
Declare string searchpattern ("ABCDEFGHIJKLMNOPQRSTUVWXYZ"); outside the loop, and pass this for the find calls. An std::string instance knows it's own length, whereas a C string requires a linear-time check to determine that, and this may (or may not) be relevant to std::string::find performance.
Try using std::stringstream, for a similar reason to why you should be using StringBuilder for Java, though most likely the repeated conversions back to string will create more problems.
Overall, the result isn't too surprising though. JavaScript, with a good JIT compiler, may be able to optimise a little better than C++ static compilation is allowed to in this case.
With enough work, you should always be able to optimise C++ better than JavaScript, but there will always be cases where that doesn't just naturally happen and where it may take a fair bit of knowledge and effort to achieve that.
What you are missing here is the inherent complexity of the find search.
You are executing the search 102 * 1024 (104 448) times. A naive search algorithm will, each time, try to match the pattern starting from the first character, then the second, etc...
Therefore, you have a string that is going from length 1 to N, and at each step you search the pattern against this string, which is a linear operation in C++. That is N * (N+1) / 2 = 5 454 744 576 comparisons. I am not as surprised as you are that this would take some time...
Let us verify the hypothesis by using the overload of find that searches for a single A:
Original: 6.94938e+06 ms
Char : 2.10709e+06 ms
About 3 times faster, so we are within the same order of magnitude. Therefore the use of a full string is not really interesting.
Conclusion ? Maybe that find could be optimized a bit. But the problem is not worth it.
Note: and to those who tout Boyer Moore, I am afraid that the needle is too small, so it won't help much. May cut an order of magnitude (26 characters), but no more.
For C++, try to use std::string for "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - in my implementation string::find(const charT* s, size_type pos = 0) const calculates length of string argument.
I just tested the C++ example myself. If I remove the the call to std::sting::find, the program terminates in no time. Thus the allocations during string concatenation is no problem here.
If I add a variable sdt::string abc = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and replace the occurence of "ABC...XYZ" in the call of std::string::find, the program needs almost the same time to finish as the original example. This again shows that allocation as well as computing the string's length does not add much to the runtime.
Therefore, it seems that the string search algorithm used by libstdc++ is not as fast for your example as the search algorithms of javascript or python. Maybe you want to try C++ again with your own string search algorithm which fits your purpose better.
C/C++ language are not easy and take years make fast programs.
with strncmp(3) version modified from c version:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
void test()
{
int limit = 102 * 1024;
char s[limit];
size_t size = 0;
while (size < limit) {
s[size++] = 'X';
if (!strncmp(s, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26)) {
fprintf(stderr, "zomg\n");
return;
}
}
printf("x's length = %zu\n", size);
}
int main()
{
test();
return 0;
}
Your test code is checking a pathological scenario of excessive string concatenation. (The string-search part of the test could have probably been omitted, I bet you it contributes almost nothing to the final results.) Excessive string concatenation is a pitfall that most languages warn very strongly against, and provide very well known alternatives for, (i.e. StringBuilder,) so what you are essentially testing here is how badly these languages fail under scenarios of perfectly expected failure. That's pointless.
An example of a similarly pointless test would be to compare the performance of various languages when throwing and catching an exception in a tight loop. All languages warn that exception throwing and catching is abysmally slow. They do not specify how slow, they just warn you not to expect anything. Therefore, to go ahead and test precisely that, would be pointless.
So, it would make a lot more sense to repeat your test substituting the mindless string concatenation part (s += "X") with whatever construct is offered by each one of these languages precisely for avoiding string concatenation. (Such as class StringBuilder.)
As mentioned by sbi, the test case is dominated by the search operation.
I was curious how fast the text allocation compares between C++ and Javascript.
System: Raspberry Pi 2, g++ 4.6.3, node v0.12.0, g++ -std=c++0x -O2 perf.cpp
C++ : 770ms
C++ without reserve: 1196ms
Javascript: 2310ms
C++
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
using namespace std::chrono;
void test()
{
high_resolution_clock::time_point t1 = high_resolution_clock::now();
int x = 0;
int limit = 1024 * 1024 * 100;
string s("");
s.reserve(1024 * 1024 * 101);
for(int i=0; s.size()< limit; i++){
s += "SUPER NICE TEST TEXT";
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << duration << endl;
}
int main()
{
test();
}
JavaScript
function test()
{
var time = process.hrtime();
var x = "";
var limit = 1024 * 1024 * 100;
for(var i=0; x.length < limit; i++){
x += "SUPER NICE TEST TEXT";
}
var diff = process.hrtime(time);
console.log('benchmark took %d ms', diff[0] * 1e3 + diff[1] / 1e6 );
}
test();
It seems that in nodejs there are better algorithms for substring search. You can implement it by yourself and try it out.

Executable runs faster on Wine than Windows -- why?

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.

Performance problems when scaling MSVC 2005's operator<< accross threads

When looking at some of our logging I've noticed in the profiler that we were spending a lot of time in the operator<< formatting ints and such. It looks like there is a shared lock that is used whenever ostream::operator<< is called when formatting an int(and presumably doubles). Upon further investigation I've narrowed it down to this example:
Loop1 that uses ostringstream to do the formatting:
DWORD WINAPI doWork1(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
out << "[0";
for (int j = 1; j < 100; ++j)
out << ", " << j;
out << "]\n";
}
return 0;
}
Loop2 that uses the same ostringstream to do everything but the int format, that is done with itoa:
DWORD WINAPI doWork2(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
char buffer[13];
out << "[0";
for (int j = 1; j < 100; ++j)
{
_itoa_s(j, buffer, 10);
out << ", " << buffer;
}
out << "]\n";
}
return 0;
}
For my test I ran each loop a number of times with 1, 2, 3 and 4 threads (I have a 4 core machine). The number of trials is constant. Here is the output:
doWork1: all ostringstream
n Total
1 557
2 8092
3 15916
4 15501
doWork2: use itoa
n Total
1 200
2 112
3 100
4 105
As you can see, the performance when using ostringstream is abysmal. It gets 30 times worse when adding more threads whereas the itoa gets about 2 times faster.
One idea is to use _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) as recommended by M$ in this article. That doesn't seem to help me. Here's another user who seem to be having a similar issue.
We need to be able to format ints in several threads running in parallel for our application. Given this issue we either need to figure out how to make this work or find another formatting solution. I may code up a simple class with operator<< overloaded for the integral and floating types and then have a templated version that just calls operator<< on the underlying stream. A bit ugly, but I think I can make it work, though maybe not for user defined operator<<(ostream&,T) because it's not an ostream.
I should also make clear that this is being built with Microsoft Visual Studio 2005. And I believe this limitation comes from their implementation of the standard library.
If the Visual Studio 2005's standard library implementation has bugs why not try other implementations? Like:
STLport
Apache C++ Standard Library (STDCXX)
or even Dinkumware upon which Visual Studio 2005 standard library is based on, maybe the have fixed the problem since 2005.
Edit: The other user you mentioned used Visual Studio 2008 SP1, which means that probably Dinkumware has not fixed this issue.
Doesn't surprise me, MS has put "global" locks on a fair few shared resources - the biggest headache for us was the BSTR memory lock a few years back.
The best thing you can do is copy the code and replace the ostream lock and shared conversion memory with your own class. I have done that where I write the stream using a printf-style logging system (ie I had to use a printf logger, and wrapped it with my stream operators). Once you've compiled that into your app you should be as fast as itoa. When I'm in the office I'll grab some of the code and paste it for you.
EDIT:
as promised:
CLogger& operator<<(long l)
{
if (m_LoggingLevel < m_levelFilter)
return *this;
// 33 is the max length of data returned from _ltot
resize(33);
_ltot(l, buffer+m_length, m_base);
m_length += (long)_tcslen(buffer+m_length);
return *this;
};
static CLogger& hex(CLogger& c)
{
c.m_base = 16;
return c;
};
void resize(long extra)
{
if (extra + m_length > m_size)
{
// resize buffer to fit.
TCHAR* old_buffer = buffer;
m_size += extra;
buffer = (TCHAR*)malloc(m_size*sizeof(TCHAR));
_tcsncpy(buffer, old_buffer, m_length+1);
free(old_buffer);
}
}
static CLogger& endl(CLogger& c)
{
if (c.m_length == 0 && c.m_LoggingLevel < c.m_levelFilter)
return c;
c.Write();
return c;
};
Sorry I can't let you have all of it, but those 3 methods show the basics - I allocate a buffer, resize it if needed (m_size is buffer size, m_length is current text length) and keep it for the duration of the logging object. The buffer contents get written to file (or OutputDebugString, or a listbox) in the endl method. I also have a logging 'level' to restrict output at runtime. So you just replace your calls to ostringstream with this, and the Write() method pumps the buffer to a file and clears the length. Hope this helps.
The problem could be memory allocation. malloc which "new" uses has an internal lock. You can see it if you step into it. Try to use a thread local allocator and see if the bad performance disappears.