How to compute (negative binomial) distribution PDF and CDF in C++? - c++

STD has many distributions, that apparently are used to generate pseudo random variables, see e.g. below code that generates and outputs some negative binomial distributed numbers.
Now this might mean that internally, there is code that computes the CDF and or PDF of the negative binomial distribution, i.e. the probability that the random variable takes on a certain value, e.g. 6. Is there a way to output that probability? If yes, how? I know I could run my own code for that, but I'd rather not do that if it there is some way to get the probabilities from std.
If possible, same question for other distributions, e.g. CDF of gamma distribution.
int main()
{
std::negative_binomial_distribution<int> negBin{ 5,0.5 };//Negative binomial distribution
std::mt19937 RNG(260783);//Random generator
for (size_t i = 0; i < 4; i++)
{
std::cout << negBin(RNG) << std::endl;
}
return 0;
}

The standard doesn't specify how an implementation should implement the distribution, other than that sampling from it should take an amortised constant number of samples from the Generator.
None of the members provide either the CDF or PDF

might mean that internally, there is code that computes the CDF and or PDF of the negative binomial distribution, i.e. the probability that the random variable takes on a certain value, e.g. 6. Is there a way to output that probability?
In general, no, sampling does not require to know PDF and/or CDF, see f.e. Marsaglia method(s) to sample normally distributed RV.
I could propose to take a look at GNU Scientific Library, it has sampling methods as well as PDF and CDF for negative binomial:
https://www.gnu.org/software/gsl/manual/html_node/The-Negative-Binomial-Distribution.html

Related

How can I get a random uint or the last digit of float in HLSL/GLSL?

I just need a random uint, better ranging from 0-6, but there is no enumeration type in openGL. I learned that I can get a random float ranging 0-1 from the code below:
frac(sin(dot(uv, float2(12.9898, 78.233))) * 43758.5453123)
I tried to do 1/above and get floor(), but it doesn't work. Then how can I get a random int? or is there a way to get the last digit of the float(so presumably still random)?
First, let's define what we mean by "random". In the context of this answer, a "random" variable is a variable whose values are unpredictable. That is, there is no function that determines/computes an outcome for the random variable when being evaluated (with any possible inputs). Or at least, no such function has been found (yet).
Obviously, when we are talking about computing here, there is no such thing as a true random variable as described above, because anything we do in computing (and by extension in a shader) is necessarily bound to the set of functions that are computable.
Your proposed function in the question:
f(uv) = frac(sin(dot(uv, float2(12.9898, 78.233))) * 43758.5453123)
is just a computable function. It takes as input a vector uv, which itself is a deterministic/computable value - such as derived from a built-in or custom varying variable giving you the "coordinates" of the current fragment.
After evaluation, the function's result itself was computable/deterministic and happens to be a value (which the input vector uv maps to). Taking different IEEE 754 rules and precisions aside (which may vary between different GPUs such as desktop ones and mobile ones), the function itself is purely deterministic/computable and therefore does not give you a random value.
We humans may think that the output is random, because we lack the intuition for the functions used to compute the result, such that when we "see" a number 0.623513632 followed by another number 0.9734126 for only slight variations in the input vector, we could draw the conclusion that "yeah, that looks pretty random", when it fact it obviously isn't. It is just what that function computed, given two input values.
So, when you already have a deterministic function like the above and wanted to obtain values in the closed range [0, 6] from it as a GLSL uint, you can simply scale the output of said function by multiplying the function's result with 7.0 and truncating the result:
g(uv) = uint(f(uv) * 7.0)
If you wanted to obtain true random numbers drawn from a random variable (whose deterministic function simply hasn't been found yet), you can obtain such values from universe background radiation (such as from random.org) and use that as an input to your shader (such as via textures or buffer objects).
But, from a computational perspective, a shader is just a function taking in values (ints, floats, ...) and computing (by means of computable functions) a deterministic result.
All we can do is to shuffle/scramble/diffuse the input bits in such a way, that the result "looks" like random to us. We then call these "pseudo-random" values.
Taking this a step further, we could now ask the question of the distribution quality of the obtained pseudo-random values. This has two qualities:
how evenly distributed are the pseudo-random values in their domain/interval? I.e. do all possible values have the same probability of occurring? Or: Do you even want to have uniformly-distributed values or should the values follow another distribution (like Guassian?)
how well are two values drawn from two sequential input values spaced apart? I.e. what is the frequency of the pseudo-random values?
There are different (deterministic) algorithms/functions depending on which distribution and which frequency spectrum your values should have. But first, you should define an answer to the two questions for your use-case.
And by the way, the commonly used function in your question to obtain pseudo-random numbers in a shader has a terrible distribution quality.
Last but not least, it should also be mentioned that true randomness (i.e. non-determinism), like when you do use an entropy source as input values, is oftentimes an undesirable property in computation, because it:
makes it difficult to repeat the same computation / output when needed, which is useful in various algorithms in the context of path tracing
makes it difficult to reproduce/debug/inspect your function for a particular run when every following execution/run will yield a different output

Boost Mersenne Twister / 53 bit precision double random value

The Boost library has a Mersenne Twister random number generator, and using the Boost Random library I can convert that into a double value.
boost::random::mt19937 rng; // produces randomness out of thin air
// see pseudo-random number generators
boost::random::uniform_real_distribution<> dblvals(0,1);
// distribution that maps to 0..1
// see random number distributions
double x = dblvals(rng); // get the number
Internally it looks like it is using an acceptance / rejection method to generate the random number.
Since the underlying integer used to create the double is 32-bits, I think this means I get a random number with 32-bits resolution, in other words 32-bits worth of randomness.
The original mt19937ar.c had a function called genrand_res53() which generated a random number with 53-bit resolution (using two 32-bit integers). Is there a way to do this in Boost?
If you have to use boost you can use boost::random::mt19937_64 to get 64 bits of randomness. If you have access to C++11 or higher you can also use std::mt19337_64 which will also give you 64 random bits.
I will note that per boost's listing boost::random::mt19937_64 runs about 2.5 times slower than boost::random::mt19937 and that is probably mirrored in its standard equivalent. if speeds is a factor then this could come into play.
Similarly to what C++ now offers (since 11), there is mt19937_64 in boost, take a look here.

Y-Axis Units on pnorm command?

When one generates a graph using the pnorm command it generates a graph with units:
Y Axis: Normal F[(Variable Name-m)/s]
X Axis: P[i] = i/(N+1)
The X-Axis seems reasonable to calculate by hand. I am confused as to what the units of the Y-Axis mean?
How does Normal Normal F[(Variable Name-m)/s] break down? Does m represent the mean and s represent the standard deviation. If so, what does the function Normal F() represent?
This is a query about the underlying statistics.
F (usually better as F) is standard statistical notation for the cumulative distribution function, often abbreviated distribution function. That's the probability of being less than any particular value. For a single variable, as here, the function approaches 0 as values decrease towards the minimum of that variable (nothing can be less than the minimum) and 1 as values increase towards its maximum (nothing can be more).
In the case of the normal (Gaussian) distribution in principle any finite value is possible. The distribution function depends on the mean m and standard deviation s, as you surmise, which specify the particular normal distribution being compared with data. So, in words we have "normal distribution function with mean and standard deviation for these data".
All documented:
Stata manual entry for pnorm
Wikpedia on normal distribution
Wikipedia on P-P plots
FAQ on plotting positions

Random number bigger than 100,000

I'm writing in C/C++ and I want to create a lot of random numbers which are bigger than 100,000. How I would do that? With rand();
You wouldn't do that with rand, but with a proper random number generator which comes with newer C++, see e.g. cppreference.com.
const int min = 100000;
const int max = 1000000;
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(min,max);
int random_int = distribution(generator); // generate random int flat in [min, max]
Don't forget to properly seed your generator.
Above I imply that rand is not a "proper" pseudo-RNG since it typically comes with a number of shortcomings. In the best case, it lacks abstraction so picking from a different distribution becomes hard and error-prone (search the web for e.g. "random range modulus"). Also replacing the underlying engine used to generate the random numbers is AFAIK impossible by design. In less optimal cases rand as a pseudo-RNG doesn't provide long enough sequence lengths for many/most use cases. With TR1/C++11 generating high-quality random numbers is easy enough to always use the proper solution, so that one doesn't need to first worry about the quality of the used pseudo-RNG when obscure bugs show up. Microsoft's STL gave a presentation giving a nice summary talk on the topic at GoingNative2013.
// Initialize rand()'s sequence. A typical seed value is the return value of time()
srand(someSeedValue);
//...
long range = 150000; // 100000 + range is the maximum value you allow
long number = 100000 + (rand() * range) / RAND_MAX;
You may need to use something larger than a long int for range and number if (100000 + range) will exceed its max value.
In general you can use a random number generator that goes between 0 and 1, and get any range you want by doing the following transformation:
x' = r x + b
So if you want random numbers between, say, 100,000 and 300,000, and x is your random number between 0 and 1, then you'd set r to be 200,000 and b to be 100,000 and x' will be within the range you want.
If you don't have access to the C++ builtins yet, Boost has a bunch of real randomizers in Boost.Random, including specific solutions for your apparent problem space.
I'd echo the comments that clarifying edits in your question would improve the accuracy of answers eg. "I need uniformly-distributed integers from 100,001 through 1,000,000".

generating a normal distribution on gmp arbitrary precision

So, I'm trying to use gmp to some calculations I'm doing, and at some point I need to generate a pseudo random number (prn) from a normal distribution.
Since gmp has a uniform random variable, that already helps a lot. However, I'm finding difficult to choose which method I should use generate the normal distribution from a uniform one. In practice, my problem is that gmp only has simple operations, and so for instance I cannot use cos or erf evaluations, since I would have to implement all by miself.
My question is to what extent can I generate prn from a normal distribution on gmp, and, if it is very difficult, if there is any arbitrary precision lib which already has normal distribution implemented.
As two examples of methods that do not work (retrieved from this question):
Ziggurat algorithm uses evaluation of f, which in this case is an non-integer exponential and thus not supported by gmp.
Box–Muller Transform uses cos and sin, which are not supported by gmp.
The Marsaglia polar method would work, if your library has a ln.
Combine a library able to generate a random numbers for a N(0,1) distribution as doubles with the uniform generator of GMP.
For instance, suppose your normal generator produced 0x8.F67E33Ap-1
Probably, just a few of those digits are really random, so truncate the number to a fixed number of binary digits (i.e. truncating to 16 bits, 0x8.F67E33Ap-1 => 0x8.F67p-1) and generate a number uniformly in the range [0x8.F67p-1, 0x8.F68p-1)
For a better approximation, instead of using a uniform distribution, you may like to calculate the values of the density function at the interval extremes (double precision is enough here) and generate a random number with the distribution associated to the trapezoid defined by those two values.
Another way to solve that problem is to just generate a table of 1000, 10000 or 100000 mpf values where N(x) becomes 1/n, 2/n, etc. then, use the uniform random generator to select one of these intervals and again, calculate a random number inside the selected interval using a uniform or linear distribution.
I ended up using mpfr which is essentially gmp with some more functionality. It already has a normal distribution implemented.