Random generation with TRNG - c++

For the following code which generates random numbers for Monte Carlo simulation, I need to receive the exact sum for each run, but this will not happen, although I have fixed the seed. I would appreciate it if anyone could point out the problem with this code
#include <cmath>
#include <random>
#include <iostream>
#include <chrono>
#include <cfloat>
#include <iomanip>
#include <cstdlib>
#include <omp.h>
#include <trng/yarn2.hpp>
#include <trng/mt19937_64.hpp>
#include <trng/uniform01_dist.hpp>
using namespace std;
using namespace chrono;
const double landa = 1;
const double exact_solution = landa / (pow(landa, 2) + 1);
double function(double x) {
return cos(x) / landa;
}
int main() {
int rank;
const int N = 1000000;
double sum = 0.0;
trng::yarn2 r[6];
for (int i = 0; i <6; i++)
{
r[i].seed(0);
}
for (int i = 0; i < 6; i++)
{
r[i].split(6,i);
}
trng::uniform01_dist<double> u;
auto start = high_resolution_clock::now();
#pragma omp parallel num_threads(6)
{
rank=omp_get_thread_num();
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
}
}
double app = sum / static_cast<double> (N);
auto end = high_resolution_clock::now();
auto diff=duration_cast<milliseconds>(end-start);
cout << "Approximation is: " <<setprecision(17) << app << "\t"<<"Time: "<< setprecision(17) << diff.count()<<" Error: "<<(app-exact_solution)<< endl;
return 0;
}

TL;DR The problem is two-fold:
Floating point addition is not associative;
You are generating different random number for each thread.
I need to receive the exact sum for each run, but this will not
happen, although I have fixed the seed. I would appreciate it if
anyone could point out the problem with this code
First, you have a race-condition on rank=omp_get_thread_num();, the variable rank is shared among all threads, to fix that you can declared the variable rank inside the parallel region, hence, making it private to each thread.
#pragma omp parallel num_threads(6)
{
int rank=omp_get_thread_num();
...
}
In your code, you should not expect that the value of the sum will be the same for different number of threads. Why ?
because you are adding doubles in parallel
double sum = 0.0;
...
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
}
and from What Every Computer Scientist Should Know about Floating
Point Arithmetic one can read:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the
expression (x+y)+z has a totally different answer than x+(y+z) when
x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the
latter).
Hence, from that you conclude that floating point addition is not
associative, and the reason why for a different number of threads you might have different sum values.
You are generating different random values per thread:
for (int i = 0; i < 6; i++)
{
r[i].split(6,i);
}
Consequently, for different number of threads, the variable sum
gets different results as well.
As kindly point out by jérôme-richard in the comments:
Note that more precise algorithm like the Kahan summation can
significantly reduces the rounding issue while being still relatively
fast.

Related

Performance of pow(x,3.0f) vs x*x*x?

The following program...
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <cmath>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += std::pow(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.
Any insights? How do you account for the 5x performance difference?
For reference I'm using gcc -O3 on Linux x86
Update: (C Version)
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += powf(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
Update 2
The following program:
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += __builtin_powif(x,3.0f);
}
return t;
}
runs in 900ms like the first program.
Why isn't pow being inlined to __builtin_powif ?
Update 3:
With -ffast-math the following program:
#include <math.h>
#include <iostream>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += powf(x, 3.0f);
}
std::cout << t;
}
runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?
The doc page about gcc builtins is explicit (emphasize mine):
Built-in Function: double __builtin_powi (double, int)
Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.
Built-in Function: float __builtin_powif (float, int)
Similar to __builtin_powi, except the argument and return types are float.
As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.
% Assuming your compiler chose to just call pow in the shared library like https://godbolt.org/z/re3baK (without -ffast-math)
I did not take a look at how pow(float, float) is implemented, but I see some points.
x*x*x is inlined while pow can't be as it is in a shared library - function call overhead difference
Whether the exponent 3.0 is constant? If compiler know something is constant, it is likely to generate more efficient code
x*x*x : Just generates assembly for float value multiplication twice
pow : This must have considered all the exponent values so probably it has general code(less efficient, may include loops)

Non-uniform FFT forward and backward test in 1D

I am learning to use a c++ library to perform non-uniform FFT (NUFFT). The library provides 3 types of NUFFT.
Type 1: forward transform from a non-uniform x grid to a uniform k-space grid.
Type 2: backward transform from a uniform k-space grid to a non-uniform x grid
Type 3: from non-uniform to non-uniform
I tested the library in 1D by performing NUFFT on a test function sin(x) from -pi to pi using Type1 NUFFT, transform it back using Type2 NUFFT, and compare the output with sin(x). At first, I tested it on a uniform x grid, which shows a very small error. The error unfortunately is very large when the test is done on a non-uniform x grid.
Two possibilities:
My implementation of NUFFT is incorrect, but the implementation is rather simple, so I doubt if this is the case.
The author mentions that Type2 is NOT the inverse of Type1, so I believe that might be the problem. Since I am not an expert in NUFFT, I wonder if there is an alternative way to perform a forward/backward test with NUFFT?
My purpose is to develop a FFT Poisson solver on a irregular mesh, so I need to perform NUFFT forward and backward, and therefore important to overcome this problem. Besides using FINUFFT, any other suggestion is also welcome.
Thank you for reading.
The code is here for those who is interested.
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex>
#include <fftw3.h>
#include <functional>
#include "finufft/src/finufft.h"
using namespace std;
int main()
{
double pi = 3.14159265359;
int N = 128*2;
int i;
double X[N];
double L = 2*pi;
double dx = L/(N);
nufft_opts opts; finufft_default_opts(&opts);
complex<double> R = complex<double>(1.0,0.0); // the real unit
complex<double> in1[N], out1[N], out2[N];
for(i = 0; i < N; i++) {
//X[i] = -(L/2) + i*dx ; // uniform grid
X[i] = -(L/2) + pow(double(i)/N,7.0)*L; //non-uniform grid
in1[i] = sin(X[i])*R ;}
int ier = finufft1d1(N,X,in1,-1,1e-10,N,out1,opts); // type-1 NUFFT
int ier2 = finufft1d2(N,X,out2,+1,1e-10,N,out1,opts); // type-2 NUFFT
// checking the error
double erl1 = 0.;
for ( i = 0; i < N; i++) {
erl1 += fabs( in1[i].real() - out2[i].real()/(N))*dx;
}
std::cout<< erl1 <<" " << ier << " "<< ier2<< std::endl ; // error
return 0;
}
For some reason, the developer made an update on their page which answers exactly my question. https://finufft.readthedocs.io/en/latest/examples.html#periodic-poisson-solve-on-non-cartesian-quadrature-grid. In brief, their NUFFT code is NOT good in the case of fully adaptive scheme, but I would still provide an answer and code here for completeness.
There are two ingredients missing in my code.
(1) I need to multiply the function, sin(x), with a weight before using NUFFT. The weight comes from the determinant of the Jacobian in their 2D example, so the weight is simply the derivative the of the nonuniform coordinate with respect to the uniform coordinate dx/dksi for a 1D example.
(2) Nk must be smaller than N.
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex>
#include <fftw3.h>
#include <functional>
#include "finufft/src/finufft.h"
using namespace std;
int main()
{
double pi = 3.14159265359;
int N = 128*2;
int Nk = 32; // smaller than N
int i;
double X[N];
double L = 2*pi;
double dx = L/(N);
nufft_opts opts; finufft_default_opts(&opts);
complex<double> R = complex<double>(1.0,0.0); // the real unit
complex<double> in1[N], out1[N], out2[N];
for(i = 0; i < N; i++) {
ksi[i] = -(L/2) + i*dx ; //uniform grid
X[i] = -(L/2) + pow(double(i)/(N-1),6)*L; //nonuniform grid
}
dX = der(ksi,X,1); // your own derivative code
for(i = 0; i < N; i++) {
in1[i] = sin(X[i]) * dX[i] * R ; // applying weight
}
int ier = finufft1d1(N,X,in1,-1,1e-10,Nk,out1,opts); // type-1 NUFFT
int ier2 = finufft1d2(N,X,out2,+1,1e-10,Nk,out1,opts); // type-2 NUFFT
// checking the error
double erl1 = 0.;
for ( i = 0; i < N; i++) {
erl1 += fabs( in1[i].real() - out2[i].real()/(N))*dx;
}
std::cout<< erl1 <<" " << ier << " "<< ier2<< std::endl ; // error
return 0;
}

Why does vc++ compiler cause this statistical pattern?

I'm running the following program:
#include <iostream>
#include <vector>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
const int N = 200; // Number of tests.
const int M = 2000000; // Number of pseudo-random values generated per test.
const int VALS = 2; // Number of possible values (values from 0 to VALS-1).
const int ESP = M / VALS; // Expected number of appearances of each value per test.
int main() {
for (int i = 0; i < N; ++i) {
unsigned seed = chrono::system_clock::now().time_since_epoch().count();
srand(seed);
vector<int> hist(VALS, 0);
for (int j = 0; j < M; ++j) ++hist[rand() % VALS];
int Y = 0;
for (int j = 0; j < VALS; ++j) Y += abs(hist[j] - ESP);
cout << Y << endl;
}
}
This program performs N tests. In each test we generate M numbers between 0 and VALS-1 while we keep counting their appearances in a histogram. Finally, we accumulate in Y the errors, which correspond to the difference between each value of the histogram and the expected value. Since the numbers are generated randomly, each of them would ideally appear M/VALS times per test.
After running my program I analysed the resulting data (i.e., the 200 values of Y) and I realised that some things where happening which I can not explain. I saw that, if the program is compiled with vc++ and given some N and VALS (N = 200 and VALS = 2 in this case), we get different data patterns for different values of M. For some tests the resulting data follows a normal distribution, and for some tests it doesn't. Moreover, this type of results seem to altern as M (the number of pseudo-random values generated in each test) increases:
M = 10K, data is not normal:
M = 100K, data is normal:
and so on:
As you can see, depending on the value of M the resulting data follows a normal distribution or otherwise follows a non-normal distribution (bimodal, dog food or kind of uniform) in which more extreme values of Y have greater presence.
This diversity of results doesn't occur if we compile the program with other C++ compilers (gcc and clang). In this case, it looks like we always obtain a half-normal distribution of Y values:
What are your thoughts on this? What is the explanation?
I carried out the tests through this online compiler: http://rextester.com/l/cpp_online_compiler_visual
The program will generate poorly distributed random numbers (not uniform, independent).
The function rand is a notoriously poor one.
The use of the remainder operator % to bring the numbers into range effectively discards all but the low-order bits.
The RNG is re-seeded every time through the loop.
[edit] I just noticed const int ESP = M / VALS;. You want a floating point number instead.
Try the code below and report back. Using the new &LT;random> is a little tedious. Many people write some small library code to simplify its use.
#include <iostream>
#include <vector>
#include <cmath>
#include <random>
#include <chrono>
using namespace std;
const int N = 200; // Number of tests.
const int M = 2000000; // Number of pseudo-random values generated per test.
const int VALS = 2; // Number of possible values (values from 0 to VALS-1).
const double ESP = (1.0*M)/VALS; // Expected number of appearances of each value per test.
static std::default_random_engine engine;
static void seed() {
std::random_device rd;
engine.seed(rd());
}
static int rand_int(int lo, int hi) {
std::uniform_int_distribution<int> dist (lo, hi - 1);
return dist(engine);
}
int main() {
seed();
for (int i = 0; i < N; ++i) {
vector<int> hist(VALS, 0);
for (int j = 0; j < M; ++j) ++hist[rand_int(0, VALS)];
int Y = 0;
for (int j = 0; j < VALS; ++j) Y += abs(hist[j] - ESP);
cout << Y << endl;
}
}

Cilk Plus code result depends on number of workers

I have a small piece of code that I would like to parallelize as I upscale. I've been using cilk_for from Cilk Plus to run the multithreading. The trouble is that I get a different result depending on the number of workers.
I've read that this might be due to a race condition, but I'm not sure what specifically about the code causes that or how to ameliorate it. Also, I realize that long and __float128 are overkill for this problem, but might be necessary in the upscaling.
Code:
#include <assert.h>
#include "cilk/cilk.h"
#include <cstring>
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <string>
#include <vector>
using namespace std;
__float128 direct(const vector<double>& Rpct, const vector<unsigned>& values, double Rbase, double toWin) {
unsigned count = Rpct.size();
__float128 sumProb = 0.0;
__float128 rProb = 0.0;
long nCombo = static_cast<long>(pow(2, count));
// for (long j = 0; j < nCombo; ++j) { //over every combination
cilk_for (long j = 0; j < nCombo; ++j) { //over every combination
vector<unsigned> binary;
__float128 prob = 1.0;
unsigned point = Rbase;
for (unsigned i = 0; i < count; ++i) { //over all the individual events
long exp = static_cast<long>(pow(2, count-i-1));
bool odd = (j/exp) % 2;
if (odd) {
binary.push_back(1);
point += values[i];
prob *= static_cast<__float128>(Rpct[i]);
} else {
binary.push_back(0);
prob *= static_cast<__float128>(1.0 - Rpct[i]);
}
}
sumProb += prob;
if (point >= toWin) rProb += prob;
assert(sumProb >= rProb);
}
//print sumProb
cout << " sumProb = " << (double)sumProb << endl;
assert( fabs(1.0 - sumProb) < 0.01);
return rProb;
}
int main(int argc, char *argv[]) {
vector<double> Rpct;
vector<unsigned> value;
value.assign(20,1);
Rpct.assign(20,0.25);
unsigned Rbase = 22;
unsigned win = 30;
__float128 rProb = direct(Rpct, value, Rbase, win);
cout << (double)rProb << endl;
return 0;
}
Sample output for export CILK_NWORKERS=1 && ./code.exe:
sumProb = 1
0.101812
Sample output for export CILK_NWORKERS=4 && ./code.exe:
sumProb = 0.948159
Assertion failed: (fabs(1.0 - sumProb) < 0.01), function direct, file code.c, line 61.
Abort trap: 6
It is because of a race condition. cilk_for is implementation of parallel for algorithm. If you want to use parallel for you must use independent iteration (independent data). It`is very important. You have to use cilk reducers for your case: https://www.cilkplus.org/tutorial-cilk-plus-reducers
To clarify, there is at least one race on sumProb. Each of the parallel workers will do a read/modify/write on that location. As sribin mentioned above, solving problems like this is what reducers are for.
It's entirely possible that there's more than one race in your program. The only way to be sure is to run it under a race detector, since finding races is one of the things that computers are much better at than humans. A free possibility is the Cilkscreen race detector, available from the cilkplus.org website. Unfortunately it doesn't support gcc/g++.

Rounding errors giving incorrect tesults in DFT?

I have been beating my head against the wall on this DFT. It should print out: 8,0,0,0,0,0,0,0 but instead I get 8 and then very very tiny numbers. Are these rounding errors? Is there anything I can do? My Radix2 FFT gives correct results, it seems silly a DFT could not also work.
I started with complex numbers so I know there is a good bit missing, I tried to strip it down to illustrate the problem.
#include <cstdlib>
#include <math.h>
#include <iostream>
#include <complex>
#include <cassert>
#define SIZE 8
#define M_PI 3.14159265358979323846
void fft(const double src[], double dst[], const unsigned int n)
{
for(int i=0; i < SIZE; i++)
{
const double ph = -(2*M_PI) / n;
const int gid = i;
double res = 0.0f;
for (int k = 0; k < n; k++) {
double t = src[k];
const double val = ph * k * gid;
double cs = cos(val);
double sn = sin(val);
res += ((t * cs) - (t * sn));
int a = 1;
}
dst[i] = res;
std::cout << dst[i] << std::endl;
}
}
int main(void)
{
double array1[SIZE];
double array2[SIZE];
for(int i=0; i < SIZE; i++){
array1[i] = 1;
array2[i] = 0;
}
fft(array1, array2, SIZE);
return 666;
}
An FFT can actually produce more accurate results than a straight DFT calculation, as the fewer arithmetic ops usually allow fewer opportunities for arithmetic quantization errors to accumulate. There's a paper by one of the FFTW authors on this topic.
Since the DFT/FFT deal with a transcendental basis function, the results will never (except perhaps in a few special cases, or by lucky accident) be exactly correct using any non-symbolic and finite computer number format. So values very close (within a few LSB) to zero should simply be ignored as noise, or considered to be the same as zero.