I set up the following race condition to generate some random bits. However, as far as I can tell, the output is NOT random. I want to understand why (for learning purposes). Here is my code:
#include <iostream>
#include <vector>
#include <atomic>
#include <thread>
#include <cmath>
using namespace std;
void compute_entropy(const vector<bool> &randoms) {
int n0 = 0, n1 = 0;
for(bool x: randoms) {
if(!x) n0++;
else n1++;
}
double f0 = n0 / ((double)n0 + n1), f1 = n1 / ((double)n0 + n1);
double entropy = - f0 * log2(f0) - f1 * log2(f1);
for(int i = 0; i < min((int)randoms.size(), 100); ++i)
cout << randoms[i];
cout << endl;
cout << endl;
cout << f0 << " " << f1 << " " << endl;
cout << entropy << endl;
return;
}
int main() {
const int N = 1e7;
bool x = false;
atomic<bool> finish1(false), finish2(false);
vector<bool> randoms;
thread t1([&]() {
for(int i = 0; !finish1; ++i)
x = false;
});
thread t2([&]() {
for(int i = 0; !finish2; ++i)
x = true;
});
thread t3([&]() {
for(int i = 0; i < N; ++i)
randoms.push_back(x);
finish1 = finish2 = true;
});
t3.join();
t1.join();
t2.join();
compute_entropy(randoms);
return 0;
}
I compile and run it like this:
$ g++ -std=c++14 threads.cpp -o threads -lpthread
$ ./threads
0101001011000111110100101101111101100100010001111000111110001001010100011101110011011000010100001110
0.473792 0.526208
0.998017
No matter how many times I run it, the results are skewed.
With 10 million numbers, the results from a proper random number generator are as one would expect:
>>> np.mean(np.random.randint(0, 2, int(1e7)))
0.5003456
>>> np.mean(np.random.randint(0, 2, int(1e7)))
0.4997095
Why is the output from race conditions not random?
There is no guarantee that a race condition would produce random output. It is not guaranteed to be purely random nor even pseudo random of any quality.
as far as I can tell, the output is NOT random.
There exists no test that can definitely disprove randomness.
There are tests that can show that a sequence probably doesn't contain some specific patterns - and thus a sequence passing multiple such tests is probably random. However, you haven't performed such test as far as I can tell. You seem to be measuring whether the distribution of the output is even - which is a separate property from randomness. As such, your conclusion that the output isn't random is not based on a relevant measurement.
Furthermore, your program has a data race. As such, the behaviour of the entire program is undefined and here is no guarantee that the progam would behave as one might otherwise have reasonably expected.
Related
The code below was taken from an example compiled with g++. The multi-threaded was 2x faster than the single-threaded.
I'm executing it in Visual Studio 2019 and the results are the opposite: the single-threaded is 2x faster than the multi-threaded.
#include<thread>
#include<iostream>
#include<chrono>
using namespace std;
using ll = long long;
ll odd, even;
void par(const ll ini, const ll fim)
{
for (auto i = ini; i <= fim; i++)
if (!(i & 1))
even += i;
}
void impar(const ll ini, const ll fim)
{
for (auto i = ini; i <= fim; i++)
if (i & 1)
odd += i;
}
int main()
{
const ll start = 0;
const ll end = 190000000;
/* SINGLE THREADED */
auto start_single = chrono::high_resolution_clock::now();
par(start, end);
impar(start, end);
auto end_single = chrono::high_resolution_clock::now();
auto single_duration = chrono::duration_cast<chrono::microseconds>(end_single - start_single).count();
cout << "SINGLE THREADED\nEven sum: " << even << "\nOdd sum: " << odd << "\nTime: " << single_duration << "ms\n\n\n";
/* END OF SINGLE*/
/* MULTI THREADED */
even = odd = 0;
auto start_multi= chrono::high_resolution_clock::now();
thread t(par, start, end);
thread t2(impar, start, end);
t.join();
t2.join();
auto end_multi = chrono::high_resolution_clock::now();
auto multi_duration = chrono::duration_cast<chrono::microseconds>(end_multi - start_multi).count();
cout << "MULTI THREADED\nEven sum: " << even << "\nOdd sum: " << odd << "\nTime: " << multi_duration << "ms\n";
/*END OF MULTI*/
cout << "\n\nIs multi faster than single? => " << boolalpha << (multi_duration < single_duration) << '\n';
}
However, If I do a small modification on my functions as shown below:
void par(const ll ini, const ll fim)
{
ll temp = 0;
for (auto i = ini; i <= fim; i++)
if (!(i & 1))
temp += i;
even = temp;
}
void impar(const ll ini, const ll fim)
{
ll temp = 0;
for (auto i = ini; i <= fim; i++)
if (i & 1)
temp += i;
odd = temp;
}
The multi-threaded performs better. I would like to know what leads to this behavior (what are the possible differences in implementation that explains it).
Also, I have compiled with gcc from www.onlinegdb.com and the results are similar to Visual Studio's in my machine.
You are a victim of false sharing.
odd and even reside next to each other, and accessing them from two threads leads to L3 cache line contention (a.k.a false sharing).
You can fix it by spreading them by 64 bytes to make sure they reside in different cache lines, for example, like this:
alignas(64) ll odd, even;
With that change I get good speedup with 2 threads:
SINGLE THREADED
Even sum: 9025000095000000
Odd sum: 9025000000000000
Time: 825954ms
MULTI THREADED
Even sum: 9025000095000000
Odd sum: 9025000000000000
Time: 532420ms
As for G++ performance - it might be performing the optimization you made manually for you. MSVC is more careful when it comes to optimizing global variables.
My first attempt at creating a header file. The solution is nonsense and nothing more than practice. It receives two numbers from the main file and is supposed to return a random entry from the vector. When I call it from a loop in the main file, it increments by 3 instead of randomly. (Diagnosed by returning the value of getEntry.) The Randomizer code works correctly if I pull it out of the header file and run it directly as a program.
int RandomNumber::Randomizer(int a, int b){
std::vector < int > vecArray{};
int range = (b - a) + 1;
time_t nTime;
srand((unsigned)time(&nTime));
for (int i = a-1; i < b+1; i++) {
vecArray.push_back(i);
}
int getEntry = rand() % range + 1;
int returnValue = vecArray[getEntry];
vecArray.clear();
return returnValue;
}
From what I read, header files should generally not contain function and variable definitions. I suspect Rand, being a function, is the source of the problem.
How, if possible, can I get my header file to create random numbers?
void random(){
double rangeMin = 1;
double rangeMax = 10;
size_t numSamples = 10;
thread_local std::mt19937 mt(std::random_device{}());
std::uniform_real_distribution<double> dist(rangeMin, rangeMax);
for (size_t i = 1; i <= numSamples; ++i) {
std::cout << dist(mt) << std::endl;
}
}
This method will give you the opportunity to generate random numbers, between two numbers this method you have to include random
There are many cases where you will optate to engender a desultory number. There are genuinely two functions you will require to ken about arbitrary number generation. The first is rand(), this function will only return a pseudo desultory number. The way to fine-tune this is to first call the srand() function.
Here is an example:
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
int main () {
int i,j;
srand( (unsigned)time( NULL ) );
for( i = 0; i < 10; i++ ) {
j = rand();
cout <<" Random Number : " << j << endl;
}
return 0;
}
Using srand( (unsigned)time( NULL ) ); Instead of using your own value use NULL for the default setting.
You can also go here for more info.
I hope I answered your question! Have a nice day!
Ted Lyngmo gave me the idea that fixed the problem. Using random appears to work correctly in a header file.
I removed/changed the following:
time_t nTime;
srand((unsigned)time(&nTime));
int getEntry = rand() % range + 1;
and replaced them with:
std::random_device rd;
std::mt19937 gen(rd());
int getEntry = gen() % range + 1;
Issue resolved. Thank you everybody for your suggestions and comments!
As an experiment, I remove the vector and focus on the randomizer `srand(T)`, where `T` is the system time `volatile time_t T = time(NULL)`. We then found that system is NOT changed during the program running (execution simply too fast).
The function `rand()` generates a pesudo-random integer using confluent rnadom generator, basically multiply the seed by another larger unsigned integer and truncated to the finite bits of `seed`. The randomizer `srand(T)` is used to initialize the seed using system time, or any number `srand(12345);` . A seed gives a fixed sequence of random number. Without calling `srand(T)`, the seed is determined by the system initial memory gabage. The seed is then changed in every generating `rand()`.
In your code, you issue randomizer `srand(T)` reset the seed to the system time in every run. But the system time didn't changed, Thus, you are reseting the `seed` to a same number.
Run this test.
#include <cstdlib>
#include <iostream>
#include <ctime>
int Randomizer(int a, int b){
volatile time_t T = time(NULL);
std::cout << "time = " << T << std::endl;
srand(T);
std::cout << "rand() = " << rand() << std::endl;
return rand();
}
int main()
{
int n1 = 1, n2 = 8;
for(int i=0; i<5; ++i)
{
std::cout << Randomizer(n1, n2) << std::endl;
}
}
The seed is reset to the system time, which is not change during execution. Thus It renders the same random number.
$ ./a.exe
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
time = 1608049336
rand() = 9468
15874
In order to see the change of system time, we add a pause in the main():
int main()
{
int n1 = 1, n2 = 8;
for(int i=0; i<5; ++i)
{
std::cout << Randomizer(n1, n2) << std::endl;
system("pause");
}
}
We can observe the system time moving on...
$ ./a.exe
time = 1608050805
rand() = 14265
11107
Press any key to continue . . .
time = 1608050809
rand() = 14279
21332
Press any key to continue . . .
time = 1608050815
rand() = 14298
20287
Press any key to continue . . .
Because system time is not much different, the first generation of confluent sequence rand() is also rather closed, but the continue sequence of numbers will be "seemingly" random. The principle for confluent random generator is that once after set the seed don't change it. Until you are working for another series of random set. Therefore, put the srand(T) funtcion just once in the main() or somewhere that executed only once.
int main()
{
srand(time(NULL)); // >>>> just for this once <<<<
int n1 = 1, n2 = 8;
for(int i=0; i<5; ++i)
{
std::cout << Randomizer(n1, n2) << std::endl;
}
}
i'm trying to optimize my code using multithreading and is not just that the program is not the double speed as is suposed to be in this dual-core computer, it is SO MUCH SLOW. And i just wanna know if i'm doing something wrong or is pretty normal that in this case use multithreading does not help. I make this recreation of how i used the multithreading, and in my computer the parallel versions take's 4 times the time in the comparation of the normal version:
#include <iostream>
#include <random>
#include <thread>
#include <chrono>
using namespace std;
default_random_engine ran;
inline bool get(){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
for (unsigned i = 0; i < repetitions; ++i)
result += get();
}
unsigned parallel_series(unsigned repetitions){
const unsigned hardware_threads = std::thread::hardware_concurrency();
cout << "Threads in this computer: " << hardware_threads << endl;
const unsigned threads_number = (hardware_threads != 0) ? hardware_threads : 2;
const unsigned its_per_thread = repetitions / threads_number;
unsigned *results = new unsigned[threads_number]();
std::thread *threads = new std::thread[threads_number - 1];
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i] = std::thread(normal_serie, its_per_thread, std::ref(results[i]));
normal_serie(its_per_thread, results[threads_number - 1]);
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i].join();
auto result = std::accumulate(results, results + threads_number, 0);
delete[] results;
delete[] threads;
return result;
}
int main()
{
constexpr unsigned repetitions = 100000000;
auto to = std::chrono::high_resolution_clock::now();
cout << parallel_series(repetitions) << endl;
auto tf = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Parallel duration: " << duration << "ms" << endl;
to = std::chrono::high_resolution_clock::now();
unsigned r = 0;
normal_serie(repetitions, r);
cout << r << endl;
tf = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Normal duration: " << duration << "ms" << endl;
return 0;
}
Things that i already know, but i didn't to make this code shorter:
I should set a max_iterations_per_thread because you don't wanna make 10 iterations per thread, but in this case we are doing one billion iterations so that is not gonna happend.
The number of iterations must be divisible by the number or threads, otherwise the code will not do an effective work.
This is the output that i get in my computer:
Threads in this computer: 2
66665160
Parallel duration: 4545ms
66664432
Normal duration: 1019ms
(Solved partially doing this changes: )
inline bool get(default_random_engine &ran){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
default_random_engine eng;
unsigned saver_result = 0;
for (unsigned i = 0; i < repetitions; ++i)
saver_result += get(eng);
result += saver_result;
}
All your threads are tripping over each other fighting for access to ran which can only perform one operation at a time because it only has one state and each operation advances its state. There is no point in running operations in parallel if the vast majority of each operation involves a choke point that cannot support any concurrency.
All elements of results are likely to share a cache line, which means there is lots of inter-core communication going on.
Try modifying normal_serie to accumulate into a local variable and only write it to results in the end.
I've been using OpenMP with Visual Studio 2010 for quite some time by now, but today I've encountered yet another baffling quirk of VS. After cutting off all the possible suspects, I was left with the program below.
It simply counts in a cycle and sometimes makes some calculation and churns out counters.
#include "stdafx.h"
#include "omp.h"
#include <string>
#include <iostream>
#include <time.h>
int _tmain(int argc, _TCHAR* argv[])
{
int count = 0;
double a = 1;
double b = 2;
double c = 3, mean_tau = 1, r_w = 1, weights = 1, r0 = 1, tau = 1, sq_tau = 1,
r_sw = 1;
#pragma omp parallel num_threads(3) shared(count)
{
int tid = omp_get_thread_num();
int pers_count = 0;
std::string someline;
for (int i = 0; i < 100000; i++)
{
pers_count++;
#pragma omp critical
{
count++;
if ((count%10000 == 0))
{
sq_tau = (r_sw / weights) * pow( 1/ r0 * tau, 2);
std::cout << count << " " << pers_count << std::endl;
}
}
}
}
std::getchar();
return 0;
}
Now, if I compile it with optimisation disabled (/Od), it works just as it should, spitting out their shared counter alongside with its private counter (which is roughly three times smaller), something along the lines of
10000 3890
20000 6523
...
300000 100000
If I turn on the optimisation (I tried all options, but for clarity's sake let's say /O2), however, for some reason the shared count seems to become private, as I start getting something like
10000 10000
10000 10000
10000 10000
...
60000 60000
50000 50000
...
100000 100000
And now that I encountered this quirk, somehow everything that was working before is rebuilt into incorrect version even if I don't change a thing. What could be the cause of this and what can I do? Thanks.
I don't know why the shared count is behaving this way. I can provide a workaround (assuming you only use atomic operations on the shared variable):
#pragma omp critical
{
#pragma omp atomic
count++;
if ((count%10000 == 0))
{
sq_tau = (r_sw / weights) * pow( 1/ r0 * tau, 2);
std::cout << count << " " << pers_count << std::endl;
}
}
I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?