Why does vc++ compiler cause this statistical pattern? - c++

I'm running the following program:
#include <iostream>
#include <vector>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
const int N = 200; // Number of tests.
const int M = 2000000; // Number of pseudo-random values generated per test.
const int VALS = 2; // Number of possible values (values from 0 to VALS-1).
const int ESP = M / VALS; // Expected number of appearances of each value per test.
int main() {
for (int i = 0; i < N; ++i) {
unsigned seed = chrono::system_clock::now().time_since_epoch().count();
srand(seed);
vector<int> hist(VALS, 0);
for (int j = 0; j < M; ++j) ++hist[rand() % VALS];
int Y = 0;
for (int j = 0; j < VALS; ++j) Y += abs(hist[j] - ESP);
cout << Y << endl;
}
}
This program performs N tests. In each test we generate M numbers between 0 and VALS-1 while we keep counting their appearances in a histogram. Finally, we accumulate in Y the errors, which correspond to the difference between each value of the histogram and the expected value. Since the numbers are generated randomly, each of them would ideally appear M/VALS times per test.
After running my program I analysed the resulting data (i.e., the 200 values of Y) and I realised that some things where happening which I can not explain. I saw that, if the program is compiled with vc++ and given some N and VALS (N = 200 and VALS = 2 in this case), we get different data patterns for different values of M. For some tests the resulting data follows a normal distribution, and for some tests it doesn't. Moreover, this type of results seem to altern as M (the number of pseudo-random values generated in each test) increases:
M = 10K, data is not normal:
M = 100K, data is normal:
and so on:
As you can see, depending on the value of M the resulting data follows a normal distribution or otherwise follows a non-normal distribution (bimodal, dog food or kind of uniform) in which more extreme values of Y have greater presence.
This diversity of results doesn't occur if we compile the program with other C++ compilers (gcc and clang). In this case, it looks like we always obtain a half-normal distribution of Y values:
What are your thoughts on this? What is the explanation?
I carried out the tests through this online compiler: http://rextester.com/l/cpp_online_compiler_visual

The program will generate poorly distributed random numbers (not uniform, independent).
The function rand is a notoriously poor one.
The use of the remainder operator % to bring the numbers into range effectively discards all but the low-order bits.
The RNG is re-seeded every time through the loop.
[edit] I just noticed const int ESP = M / VALS;. You want a floating point number instead.
Try the code below and report back. Using the new &LT;random> is a little tedious. Many people write some small library code to simplify its use.
#include <iostream>
#include <vector>
#include <cmath>
#include <random>
#include <chrono>
using namespace std;
const int N = 200; // Number of tests.
const int M = 2000000; // Number of pseudo-random values generated per test.
const int VALS = 2; // Number of possible values (values from 0 to VALS-1).
const double ESP = (1.0*M)/VALS; // Expected number of appearances of each value per test.
static std::default_random_engine engine;
static void seed() {
std::random_device rd;
engine.seed(rd());
}
static int rand_int(int lo, int hi) {
std::uniform_int_distribution<int> dist (lo, hi - 1);
return dist(engine);
}
int main() {
seed();
for (int i = 0; i < N; ++i) {
vector<int> hist(VALS, 0);
for (int j = 0; j < M; ++j) ++hist[rand_int(0, VALS)];
int Y = 0;
for (int j = 0; j < VALS; ++j) Y += abs(hist[j] - ESP);
cout << Y << endl;
}
}

Related

Is accessing container element time-consuming?

I want to count GCD of integers and save them. I find that the time consuming part is not to calculate GCD but to save result to the map. Do I use std::map in a bad way?
#include <set>
#include <iostream>
#include <chrono>
#include "timer.h"
using namespace std;
int gcd (int a, int b)
{
int temp;
while (b != 0)
{
temp = a % b;
a = b;
b = temp;
}
return(a);
}
int main() {
map<int,int> res;
{
Timer timer;
for(int i = 1; i < 10000; i++)
{
for(int j = 2; j < 10000; j++)
res[gcd(i,j)]++;
}
}
{
Timer timer;
for(int i = 1; i < 10000; i++)
{
for(int j = 2; j < 10000; j++)
gcd(i, j);
}
}
}
6627099us(6627.1ms)
0us(0ms)
You should use some real benchmarking library to test this kind of code. In your particular case, the second loop where you discard the results of gcd was probably optimized away. With quickbench I see not that much difference between running just the algorithm and storing the results in std::map or std::unordered_map. I used randomized integers for testing, which is maybe not the best for GCD algorithm, but you can try other approaches.
Code under benchmark without storage:
constexpr int N = 10000;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> distrib(1, N);
benchmark::DoNotOptimize(gcd(distrib(gen), distrib(gen)));
and with storage:
benchmark::DoNotOptimize(res[gcd(distrib(gen), distrib(gen))]++);
Results:
You are using std::map correctly. However, you are using an inefficient container for your problem. Given that the possible values of gcd(x,y) are bounded by N, a std::vector would be the most efficient container to store the results.
Specifically,
int main() {
const int N = 10'000;
std::vector<int> res(N, 0); // initialize to N elements with value 0.
...
}
Using parallelism will speed up the program even further. Each thread would have it's own std::vector to compute local results. Once a thread is finished, the results would be added to the result vector in a thread-safe manner (e.g. using std::mutex).

Minimum number of steps to equalize distinct character of the string [duplicate]

This question already has answers here:
How to find the minimum number of operation(s) to make the string balanced?
(5 answers)
Closed 1 year ago.
I'm trying to write this program that asks for user input of string, my job is to print out the minimum number of steps required to equalize the frequency of distinct characters of the string.
Example
Input
6
aba
abba
abbc
abbbc
codedigger
codealittle
Output
1
0
1
2
2
3
Here is my program:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <unordered_map>
using namespace std;
int main()
{
unordered_map<char, int >m;
vector<int> vec1, vec2;
string s;
int n;
cin >> n;
cin.ignore();
for (int i = 0; i < n; ++i)
{
m.clear();
vec1.clear();
getline(cin, s);
for (int i = 0; i < s.size(); i++)
m[s[i]]++;
for (auto itr : m)
vec1.push_back(itr.second);
sort(vec1.begin(), vec1.end());
int mid = vec1[vec1.size() / 2];
int ans = 0;
for (auto itr : vec1)
ans += abs(mid - itr);
vec2.push_back(ans);
}
for (int i = 0; i < vec2.size(); ++i)
cout << vec2[i] << endl;
}
What I tried to do is for each test case:
Using an unordered_map to count the frequency of the characters of the string.
Push the key values of the map to a vector.
Sort the vector in ascending order.
Calculate the middle element of the vector to equalize the distinct characters with as least steps as possible.
The result will add the difference between the middle element with the current element.
Push the result to another vector and print it.
But my result is wrong at test case number 5:
1
0
1
2
3 // The actual result is 2
3
I don't understand why I get the wrong result, can anyone help me with this? Thanks for your help!
The issue is that your algorithm is not finding the optimal number of steps.
Consider the string you obtained an incorrect answer for: codedigger. It has 4 letters of frequency 1 (coir) and 3 letters of frequency 2 (ddeegg).
The optimal way is not to convert half the letters of frequency 2 into some new character (not present in the string) to make all frequency 1. From my understanding, your implementation is counting the number of steps that this would require.
Instead, consider this:
c[o]dedigge[r]
If I replace o with c and r with i, I obtain:
ccdediggei
which already has equalized character frequencies. You will note that I only performed 2 edits.
So without giving you a solution, I believe this might still answer your question? Perhaps with this in mind, you can come up with a different algorithm that is able to find the optimal number of edits.
Your code correctly measures the frequencies of each letter, as the important information.
But then, there were mainly two issues:
The main target value (final equalized frequency) is not necessarily equal to the median value. In particular, this value must divide the total number of letters
For a given targeted height value, your calculation of the number of steps is not correct. You must pay attention not to count twice the same mutation. Moreover, the general formula is different, depending the final number of different letters is equal, less or higher than the original number of letters.
The following code focuses on correctness, not on efficiency too much. It considers all the possible values of the targeted height (frequency), i.e. all the divisors of the total number of letters.
If efficiency is really a concern (not mentioned in the post), then for example one could consider that the best value is unlikely to be very far from the initial average frequency value.
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <numeric>
#include <unordered_map>
// calculates the number of steps for a given target
// This code assumes that the frequencies are sorted in descending order.
int n_steps (std::vector<int>& freq, int target, int nchar) {
int sum = 0;
int n = freq.size();
int m = nchar/target; // new number of characters
int imax = std::min (n, m);
for (int i = 0; i < imax; ++i) {
sum += std::abs (freq[i] - target);
}
for (int i = imax; i < n; ++i) {
sum += freq[i];
}
if (m > n) sum += m-n;
sum /= 2;
return sum;
}
int main() {
std::unordered_map<char, int >m;
std::vector<int> vec1, vec2;
std::string s;
int n;
std::cin >> n;
std::cin.ignore();
for (int i = 0; i < n; ++i)
{
m.clear();
vec1.clear();
//getline(cin, s);
std::cin >> s;
for (int i = 0; i < s.size(); i++)
m[s[i]]++;
for (auto itr : m)
vec1.push_back(itr.second);
sort(vec1.begin(), vec1.end(), std::greater<int>());
int nchar = s.size();
int n_min_oper = nchar+1;
for (int target = 1; target <= nchar; ++target) {
if (nchar % target) continue;
int n_oper = n_steps (vec1, target, nchar);
if (n_oper < n_min_oper) n_min_oper = n_oper;
}
vec2.push_back(n_min_oper);
}
for (int i = 0; i < vec2.size(); ++i)
std::cout << vec2[i] << std::endl;
}

linear search for number vector in c++

I am trying to output 9 random non repeating numbers. This is what I've been trying to do:
#include <iostream>
#include <cmath>
#include <vector>
#include <ctime>
using namespace std;
int main() {
srand(time(0));
vector<int> v;
for (int i = 0; i<4; i++) {
v.push_back(rand() % 10);
}
for (int j = 0; j<4; j++) {
for (int m = j+1; m<4; m++) {
while (v[j] == v[m]) {
v[m] = rand() % 10;
}
}
cout << v[j];
}
}
However, i get repeating numbers often. Any help would be appreciated. Thank you.
With a true random number generator, the probability of drawing a particular number is not conditional on any previous numbers drawn. I'm sure you've attained the same number twice when rolling dice, for example.
rand(), which roughly approximates a true generator, will therefore give you back the same number; perhaps even consecutively: your use of % 10 further exacerbates this.
If you don't want repeats, then instantiate a vector containing all the numbers you want potentially, then shuffle them. std::shuffle can help you do that.
See http://en.cppreference.com/w/cpp/algorithm/random_shuffle
When j=0, you'll be checking it with m={1, 2, 3}
But when j=1, you'll be checking it with just m={2, 3}.
You are not checking it with the 0th index again. There, you might be getting repetitions.
Also, note to reduce the chances of getting repeated numbers, why not increase the size of random values, let's say maybe 100.
Please look at the following code to get distinct random values by constantly checking the used values in a std::set:
#include <iostream>
#include <vector>
#include <set>
int main() {
int n = 4;
std::vector <int> values(n);
std::set <int> used_values;
for (int i = 0; i < n; i++) {
int temp = rand() % 10;
while (used_values.find(temp) != used_values.end())
temp = rand() % 10;
values[i] = temp;
}
for(int i = 0; i < n; i++)
std::cout << values[i] << std::endl;
return 0;
}

Arriving at a close approximation of the probability using code

I was given a math question on probability. It goes like this:
There are 1000 lotteries and each has 1000 tickets. You decide to buy 1 ticket per lottery. What is the probability that you win at least one lottery?
I was able to do it mathematically on paper (arrived at 1 - (999/1000)^1000), but an idea of carrying out large iterations of the random experiment on my computer occurred to me. So, I typed some code — two versions of it to be exact, and both malfunction.
Code 1:
#include<iostream>
#include <stdlib.h>
using namespace std;
int main() {
int p2 = 0;
int p1 = 0;
srand(time(NULL));
for (int i = 0; i<100000; i++){
for(int j = 0; j<1000; j++){
int s = 0;
int x = rand()%1000;
int y = rand()%1000;
if(x == y)
s = 1;
p1 += s;
}
if(p1>0)
p2++;
}
cout<<"The final probability is = "<< (p2/100000);
return 0;
}
Code 2:
#include<iostream>
#include <stdlib.h>
using namespace std;
int main() {
int p2 = 0;
int p1 = 0;
for (int i = 0; i<100000; i++){
for(int j = 0; j<1000; j++){
int s = 0;
srand(time(NULL));
int x = rand()%1000;
srand(time(NULL));
int y = rand()%1000;
if(x == y)
s = 1;
p1 += s;
}
if(p1>0)
p2++;
}
cout<<"The final probability is = "<< (p2/100000);
return 0;
}
Code 3 (refered to some advanced text, but I don't understand most of it):
#include<iostream>
#include <random>
using namespace std;
int main() {
int p2 = 0;
int p1 = 0;
random_device rd;
mt19937 gen(rd());
for (int i = 0; i<100000; i++){
for(int j = 0; j<1000; j++){
uniform_int_distribution<> dis(1, 1000);
int s = 0;
int x = dis(gen);
int y = dis(gen);
if(x == y)
s = 1;
p1 += s;
}
if(p1>0)
p2++;
}
cout<<"The final probability is = "<< (p2/100000);
return 0;
}
Now, all of these codes output the same text:
The final probability is = 1
Process finished with exit code 0
It seems that the rand() function has been outputting the same value over all the 100000 iterations of the loop. I haven't been able to fix this.
I also tried using randomize() function instead of the srand() function, but it doesn't seem to work and gives weird errors like:
error: ‘randomize’ was not declared in this scope
randomize();
^
I think that randomize() has been discontinued in the later versions of C++.
I know that I am wrong on many levels. I would really appreciate if you could patiently explain me my mistakes and let me know some possible corrections.
You should reset your count (p1) at the beginning of the outer loop. Also, be aware of the final integer division p2/100000, any value of p2 < 100000 would result in 0.
Look at this modified version of your code:
#include <iostream>
#include <random>
int main()
{
const int number_of_tests = 100000;
const int lotteries = 1000;
const int tickets_per_lottery = 1000;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> lottery(1, tickets_per_lottery);
int winning_cases = 0;
for (int i = 0; i < number_of_tests; ++i )
{
int wins = 0; // <- reset when each test start
for(int j = 0; j < lotteries; ++j )
{
int my_ticket = lottery(gen);
int winner = lottery(gen);
if( my_ticket == winner )
++wins;
}
if ( wins > 0 )
++winning_cases;
}
// use the correct type to perform these calculations
double expected = 1.0 - std::pow((lotteries - 1.0)/lotteries, lotteries);
double probability = static_cast<double>(winning_cases) / number_of_tests;
std::cout << "Expected: " << expected
<< "\nCalculated: " << probability << '\n';
return 0;
}
A tipical run would output something like:
Expected: 0.632305
Calculated: 0.63125
Only seed the pseudorandom number generator by srand once at the beginning of your program. When you seed it over and over again you reset the pseudorandom number generator to the same initial state. time has a granularity measured in seconds, by default. Odds are you are getting all 1000 iterations - or most of them - within a single second.
See this answer to someone else's question for a general description of how pseudorandom number generators work.
This means that you should be creating one instance of a PRNG in your program and seeding it one time. Don't do either of those tasks inside loops, or inside functions that get called multiple times, unless you really know what you're doing and are trying to do something sophisticated such as using correlation induction strategies such as common random numbers or antithetic variates to achieve "variance reduction".

c++ random engines not really random

I'm playing a bit with c++ random engines, and something upsets me.
Having noticed that the values I had were roughly of the same order, I did the following test:
#include <random>
#include <functional>
#include <iostream>
int main()
{
auto res = std::random_device()();
std::ranlux24 generator(res);
std::uniform_int_distribution<uint32_t> distribution;
auto roll = std::bind(distribution, generator);
for(int j = 0; j < 30; ++j)
{
double ssum = 0;
for(int i = 0; i< 300; ++i)
{
ssum += std::log10(roll());
}
std::cout << ssum / 300. << std::endl;
}
return 0;
}
and the values I printed were all about 9.2 looking more like a normal distribution, whatever the engine I used.
Is there something I have not understood correctly?
Thanks,
Guillaume
Having noticed that the values I had were roughly of the same order
This is exactly what you'd expect with a uniform random number generator. There are 9 times as many integers in the range [10^(n-1),10^n) as there are in the range [0,10^(n-1)).