How do I most effectively prevent my normally-distributed random variable from being zero? - c++

I'm writing a Monte Carlo algorithm, in which at one point I need to divide by a random variable. More precisely: the random variable is used as a step width for a difference quotient, so I actually first multiply something by the variable and then again divide it out of some locally linear function of this expression. Like
double f(double);
std::tr1::variate_generator<std::tr1::mt19937, std::tr1::normal_distribution<> >
r( std::tr1::mt19937(time(NULL)),
std::tr1::normal_distribution<>(0) );
double h = r();
double a = ( f(x+h) - f(x) ) / h;
This works fine most of the time, but fails when h=0. Mathematically, this is not a concern because in any finite (or, indeed, countable) selection of normally-distributed random variables, all of them will be nonzero with probability 1. But in the digital implementation I will encounter an h==0 every ≈2³² function calls (regardless of the mersenne twister having a period longer than the universe, it still outputs ordinary longs!).
It's pretty simple to avoid this trouble, at the moment I'm doing
double h = r();
while (h==0) h=r();
but I don't consider this particularly elegant. Is there any better way?
The function I'm evaluating is actually not just a simple ℝ->ℝ like f is, but an ℝᵐxℝⁿ -> ℝ in which I calculate the gradient in the ℝᵐ variables while numerically integrating over the ℝⁿ variables. The whole function is superimposed with unpredictable (but "coherent") noise, sometimes with specific (but unknown) outstanding frequencies, that's what gets me into trouble when I try it with fixed values for h.

your way seems elegant enough, maybe a little different:
do {
h = r();
} while (h == 0.0);

The ratio of two normally-distributed random variables is the Cauchy distribution. The Cauchy distribution is one of those nasty distributions with an infinite variance. Very nasty indeed. A Cauchy distribution will make a mess of your Monte Carlo experiment.
In many cases where the ratio of two random variables is computed, the denominator is not normal. People often use a normal distribution to approximate this non-normally distributed random variable because
normal distributions are usually so easy to work with,
usually have such nice mathematical properties,
the normal assumption appears to be more or less correct, and
the real distribution is a bear.
Suppose you are dividing by distance. Distance is semi-positive definite by definition, and is often positive definite as a random variable. So right off the bat distance can never be normally distributed. Nonetheless, people often assume a normal distribution for distance in cases where the mean is much, much larger than the standard deviation. When this normal assumption is made you need to protect against those non-real values. One simple solution is a truncated normal.

If you want to preserve normal distribution you have to either exclude 0 or assign 0 to a new previously non-occurring value. Since the second is most likely not possible in the finite ranges of computer science the first is our only option.

A function (f(x+h)-f(x))/h has a limit as h->0 and therefore if you encounter h==0 you should use that limit. The limit would be f'(x) so if you know the derivative you can use it.
If what you are actually doing is creating number of discrete points though that approximate a normal distribution, and this is good enough for your distribution, create it in a way that none of them will actually have the value 0.

Depending on what you're trying to compute, perhaps something like this would work:
double h = r();
double a;
if (h != 0)
a = ( f(x+h) - f(x) ) / h;
else
a = 0;
If f is a linear function, this should (I think?) remain continuous at h = 0.
You might also want to instead consider trapping division-by-zero exceptions to avoid the cost of the branch. Note that this may or may not have a detrimental effect on performance - benchmark both ways!
On Linux, you will need to build the file that contains your potential division by zero with -fnon-call-exceptions, and install a SIGFPE handler:
struct fp_exception { };
void sigfpe(int) {
signal(SIGFPE, sigfpe);
throw fp_exception();
}
void setup() {
signal(SIGFPE, sigfpe);
}
// Later...
try {
run_one_monte_carlo_trial();
} catch (fp_exception &) {
// skip this trial
}
On Windows, use SEH:
__try
{
run_one_monte_carlo_trial();
}
__except(GetExceptionCode() == EXCEPTION_INT_DIVIDE_BY_ZERO ?
EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
{
// skip this trial
}
This has the advantage of potentially having less effect on the fast path. There is no branch, although there may be some adjustment of exception handler records. On Linux, there may be a small performance hit due to the compiler generating more conservative code for for -fnon-call-exceptions. This is less likely to be a problem if the code compiled under -fnon-call-exceptions does not allocate any automatic (stack) C++ objects. It's also worth noting that this makes the case in which division by zero does happen VERY expensive.

Related

How to sample from a normal distribution restricted to a certain interval, C++ implementation?

With this function I can sample from a normal distribution. I was wondering how could I sample efficiently from a normal distribution restricted to a certain interval [a,b]. My trivial approach would be to sample from the normal distribution and then keep the value if it belongs to a certain interval, otherwise re-sample. However would probably discards many values before I get a suitable one.
I could also approximate the normal distribution using a triangular distrubution, however I don't think this would be accurate enough.
I could also try to work on the cumulative function, but probably this would be slow as well. Is there any efficient approach to the problem?
Thx
I'm assuming you know how to transform to and from standard normal with shifting by μ and scaling by σ.
Option 1, as you said, is acceptance/rejection. Generate normals as usual, reject them if they're outside the range [a, b]. It's not as inefficient as you might think. If p = P{a < Z < b}, then the number of trials required follows a geometric distribution with parameter p and the expected number of attempts before accepting a value is 1/p.
Option 2 is to use an inverse Gaussian function, such as the one in boost. Calculate lo = Φ(a) and hi = Φ(b), the probabilities of your normal being below a and b, respectively. Then generate U distributed uniformly between lo and hi, and crank the resulting set of U's through the inverse Gaussian function and rescale to get outcomes with the desired truncated distribution.
The normal distribution is an integral, see the formula:
std::cout << "riemann_midpnt_sum = " << 1 / (sqrt(2*PI)) * riemann_mid_point_sum(fctn, -1, 1.0, 100) << '\n';
// where fctn is the function inside the integral
double fctn(double x) {
return exp(-(x*x)/2);
}
output: "riemann_midpnt_sum = 0.682698"
This calculates the normal distribution (standard) from -1 to 1.
This is using a riemman sum approximate the integral. You can take the riemman sum from here
You could have a look at the implementation of the normal dist function in your standard library (e.g., https://gcc.gnu.org/onlinedocs/gcc-4.6.3/libstdc++/api/a00277.html), and figure out a way to re-implement this with your constraint.
It might be tricky to understand the template-heavy library code, but if you really need speed then the trivial approach is not well suited, particularly if your interval is quite small.

How to generate a massive amount of high quality Random Numbers?

I'm working on a random walk simulation of particles moving in a lattice. For that reason I must create a massive amount of random numbers, about 10^12 and above. Currently I'm using the possibilities C++11 provides with <random>. When profiling my program, I see that a major amount of time is spent in <random>. The vast majority of those numbers are between 0 and 1, evenly distributed. Here a then I need a number from a binomial distribution. But the focus lies on the 0..1 numbers.
The question is: What can I do to reduce the CPU time needed to generate these numbers and what would the impact be on their quality?
As you can see, I tried different engines, but that had no big effect on CPU time. Further, what is the difference between my uniform01(gen) and generate_canonical<double,numeric_limits<double>::digits>(gen) anyhow?
Edit: Reading through the answers I conclude that there is not THE ideal solution for my problem. Thus I decided to first make my program multi threading capable and run multiple RNG in different threads (seeded with one random_device number + an thread individual increment). For the time being this seams to be the most unavoidable step (multi threading would be required anyhow). As a further step, pending on exact requirements I consider switching to the suggested Intel RNG or to Thrust. Meaning that my RNG implementation should not be to complex, which, currently is is not. But for now I like to focus on the physical correctness of my model and not on programming stuff, this comes as soon as the output of my program is physically correct.
Thrust
Concerning Intel RNG
Here is what I do currently:
class Generator {
public:
Generator();
virtual ~Generator();
double rand01(); //random number [0,1)
int binomial(int n, double p); //binomial distribution with n samples with probability p
private:
std::random_device randev; //seed
/*Engines*/
std::mt19937_64 gen;
//std::mt19937 gen;
//std::default_random_engine gen;
/*Distributions*/
std::uniform_real_distribution<double> uniform01;
std::binomial_distribution<> binomialdist;
};
Generator::Generator() : randev(), gen(randev()), uniform01(0.,1.), binomial(1,1.) {
}
Generator::~Generator() { }
double Generator::rand01() {
//return uniform01(gen);
return generate_canonical<double,numeric_limits<double>::digits>(gen);
}
int Generator::binomialdist(int n, double p) {
binomial.param(binomial_distribution<>::param_type(n,p));
return binomial(gen);
}
You can pre-process random numbers and use them when you need.
If you need true random numbers I suggest you to use a service like http://www.random.org/ that ensures random numbers calculated by environment ambient instead that some algorithm.
And, speaking about random numbers, you must also check this:
If you need a massive amount of random numbers, and I mean MASSIVE, do a careful search on the internet for IBM's floating point random number generator, published maybe ten years ago. You'll have to buy either a PowerPC machine, or a newer Intel machine with fused multiply-add. They achieved random numbers at a rate of one per cycle per core. So if you bought a new Mac Pro, you could achieve probably 50 billion random numbers per second.
Perhaps instead of using a CPU you could use a GPU to generate many numbers concurrently?
Efficient Random Number Generation and Application Using CUDA
On my i3, the following program runs in about five seconds:
#include <random>
std::mt19937_64 foo;
double drand() {
union {
double d;
long long l;
} x;
x.d = 1.0;
x.l |= foo() & (1LL<<53)-1;
return x.d-1;
}
int main() {
double d;
for (int i = 0; i < 1e9; i++)
d += drand();
printf("%g\n", d);
}
whereas replacing the drand() call with the following results in a program that runs in about ten seconds:
double drand2() {
return std::generate_canonical<double,
std::numeric_limits<double>::digits>(foo);
}
Using the following instead of drand() also results in a program that runs in about ten seconds:
std::uniform_real_distribution<double> uni;
double drand3() {
return uni(foo);
}
Perhaps the hacky drand() above suits your purposes better than the standard solutions..
Task Definition
OP asks to get answer for both the
1. Speed of generation -- assuming a set of 10E+012 random numbers to be "massive"
and
2. Quality of generator -- with a weak assumption that numbers just evenly distributed over some range of values are also random
However, there are more cardinal aspects to be addressed and successfully solved for the real system:
A. Define, whether your system simulation needs to be provided with a guarantee of a repeatability of the sequence of the random numbers for future re-runs of an experiment.
If this is not the case, the re-runs of the simulated experiment will yield principally different results then the randomizer process ( or pre-randomizer and randomized-selector ) need not worry about their re-entrant, state-full mode of operation and will get much simpler implementation.
B. Define, to what level do you need to proof a quality of randomness of the generated random numbers ( or does the generated sets of random numbers have to belong to some specific law of statistic theory ( some known synthetic distributions or truly random with an utmost Kolmogorov complexity of the resulting set of random numbers )). One need not be NSA expert to state that numerical generators of true-random sequences is a very hard issue and has it's computational costs associated with production of high-randomness products.
Hyper-chaotic and true-random sequences are computationally extemely expensive. Using low- or poor-randomness generators is not an option for randomness-quality sensitive applications ( whatever the marketing papers may say, no MIL-STD- or NSA-graded system will ever try this compromised quality in enviroments, where the results indeed matter, so why to settle for less in scientific simulations? Perhaps not a problem if you do not mind to miss so many "unvisited" states of the simulated phenomena ).
C. Verify, how many random numbers does your simulation system need to "consume per [usec]" and whether this design requirement parameter is constant or may get scaled-up by going into multi-threaded, vectorised, Grid-/Cloud-based distributed computation framework.
D. Does your simulation system require to maintain a global or per-thread- or perGrid/CloudNode- individual access management to the pool-of-randomized numbers in case of vectorized or Grid/Cloud-based computational strategy.
Task Solution Approach
Fastest [1] and best [2] solution with [A] and [B] solved and options for [D] is to pre-generate an utmost randomness quality numbers into an adequate access-pool ( and pay an acceptable cost of [C] and [D] on access-policy and access-management controls to re-read from the pool, rather than to re-generate ).

Test of lot of math operations in a class

Is there a way of testing functions inside a class in an easy way for correct results? I mean, I have been looking at google test unit testing, but seems more to find fails in the work classes and functions, more than in the expected result.
For example, from math theory one could know which is the square root of all numbers, now you want to check a sqrt function, seeking for floating point precision errors, and then you also want to check lot of functions that use floats and look for any precision error, is there a way to make this easy and fast ?
I can think of 2 direct solutions
1)
one of the easiest ways to test for accuracy of mathematical functions is similar to what is used as definition work for limits in calculus. taking the value to be tested, and then also using a value that is "close" on both sides. I have heard of analogies drawn between limit analysis and unit testing, but keep in mind that if your looking for speed this will not be your best options. and that this will only work on continues operations, and that this analogy is for definition work only
so what you would do is have a "limitDomain" variable defined per function (this is because some operations are more accurate then others for reasoning look up taylor approximation of [function]), and then use that as you limiter. then test: low, high, and then the value itself, and then take the avg of all three within a given margin of error,
float testMathOpX(float _input){
float low = 0.0f;
float high = 0.0f;
low = _input - limitDomainOpX;
high = _input + limitDomainOpX;
low = OpX(low);
_input = OpX(_input);
high = OpX(high);
// doing 3 separate averages with division by 2 mains the worst decimal you will have is a trailing 5, or in some cases a trailing 25
low = (low + _input)/2
high = (_input + high)/2;
_input = (low + high)/2
return _input;
}
2)
the other method that I can think of is more of a table of values approach being that you take the input, and then check to see where on the domain of the operation it lies, and if it lies within certain values then you use value replacement. The thing to realize is that you need to have a lot of work ahead of time to get these table of values, and then it becomes just domain testing of the value your taking in in the form of:
if( (_input > valLow) && (_input < valHigh)){
... replace the value with an empirically found value
}
the problem with this is that you need o find those empirically found values.
Do you have requirements on the precision or do you want to find the precision?
If it is the former, then it is not hard to create test cases using any test framework.
y = myfunc(x);
if (y > expected_y + allowed_error || y < expected_y - allowed_error) {
// Test failed
...
}
Edit:
There are two routes to finding the precision, through testing and through algorithm analysis.
Testing should be straightforward: Compare the output with the correct values (which you have to obtain in some way).
Algortithm analysis is when you calculate the expected size of the error by calculating the error of the algorithm and the error caused by lack of precision in floating point arithmetic.

Predefinition of often used values in computations - does it change anything?

I'm auto generating C code to compute large expressions and try to figure out with simple examples whether it makes sense to predefine certain subparts in separate variables.
As a simple example, say we compute something of the form:
#include <cmath>
double test(double x, double y) {
const double c[9][9] = { ... }; // constants properly initialized, irrelevant
double expr = c[0][0]*x*y
+ c[1][0]*pow(x,2)*y + ... + c[8][0]*pow(x,9)*y
+ c[1][1]*pow(x,2)*pow(y,2) + ... + c[8][1]*pow(x,9)*pow(y,2)
+ ...
with all c[i][j] properly initialized. In reality those expressions contain tens of millions of multiplications and additions.
A colleague now proposed -- to reduce the number of calls to pow() and to cache often needed values in the expressions -- to define every power of x and y in a separate variable, which is no big deal as the code is auto generated anyway, like this:
double xp2 = pow(x,2);
double xp3 = pow(x,3);
double xp4 = pow(x,4);
// ...
// same for pow(y,n)
I think, however, that this is unnecessary, as the compiler should take care of these optimizations.
Unfortunately, I have no experience with reading and interpreting assembly but I think I see that all the calls to pow() are optimized out, is this right? Also, does the compiler cache the values for pow(x,2), pow(x,3), etc?
Thanks in advance for your input!
Using pow with integer arguments... ouch ! Typical implementations of pow are tuned for the general case of floating point arguments, which is why it is usually way slower to write
pow(x, 2) ( = exp(2 * log(x)) )
than
x * x
What I state here is very compiler dependant though. On one hand, some compilers may not even know that pow(x, 2) will yield the same value for a given x (after all, the extern function pow could have side effects), so you don't have any guarantee that common subexpressions will be eliminated. The pow function, on some (many ?) platforms/toolchains, is provided by a library the compiler has no control onto.
On other implementations though, the compiler may turn those pow calls into multiplications, or at least into intrinsics, which may in turn specialize for integer exponents. Your mileage will vary.
The first thing I'd do is to replace calls to pow by multiplications. For larger exponents, you may also do, eg.
double x2 = x * x;
double x3 = x * x2;
double x4 = x2 * x2;
Note that (credits to #Stephen Canon) doing repeated multiplications (with the above quick exponentiation scheme) will introduce roundoff error whose magnitude is proportional to the number of multiplications (ie. O(log exponent)). This error is typically tolerable, but pow guarantees exactness within one unit of least precision.
The compiler may perform common subexpression elimination- remember that it can't guarantee that all functions are re-entrant, but if pow is inlined, then it may well do this.
A good way to compute polynomials is Horner's rule. (eg here) which doesn't require pow() or any extra memory.
Your expression is x*y times a polynomial in y each of whose coefficients is a polynomial in x.
Each of these coefficients can be calculated using Horner with 8 multiplies and additions, and the polynomial in y with 8 more multiplies and additions for a total of 74 multiplies and 72 additions , whereas your sample code looks to me like more that 200 multiplications and more than a hundred calls to pow().
pow may be optimized away depending on the toolchain. The only way you can tell is to try it and see.
In the general case, unless the implementation of pow is visible to the compiler as a macro or inline, then the compiler can't cache the result as it doesn't know what side-effects the function may have.
Profile, find out where the bottlenecks are.
If the sub-expressions are used frequently, it may make sense to cache or store the intermediate values. However, accessing these values may take more time than letting the values sit in a data pipeline within the processor. Data fetches outside of the processor are much slower than fetching from its internal data cache.
Also try using Algebra to simplify the mathematical expressions. Perhaps even Linear Algebra to find some more efficient matrix expressions.
You may want to isolate the calculations to expressions involving one variable. Compilers can optimize code better when only one variable is used or changing at a time. For example, substitute the y variable with expressions involving x, if possible. This would reduce to an expression only involving x.
Also search the web for "data driven design" or "data oriented design". These sites show how to optimize code for data centric applications.

A good way to do a fast divide in C++?

Sometimes I see and have used the following variation for a fast divide in C++ with floating point numbers.
// orig loop
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
// alternative
double y = 44100;
double y_div = 1.0 / y;
for(int i=0; i<10000; ++i) {
double z = x * y_div;
}
But someone hinted recently that this might not be the most accurate way.
Any thoughts?
On just about every CPU, a floating point divide is several times as expensive as a floating point multiply, so multiplying by the inverse of your divisor is a good optimization. The downside is that there is a possibility that you will lose a very small portion of accuracy on certain processors - eg, on modern x86 processors, 64-bit float operations are actually internally computed using 80 bits when using the default FPU mode, and storing it off in a variable will cause those extra precision bits to be truncated according to your FPU rounding mode (which defaults to nearest). This only really matters if you are concatenating many float operations and have to worry about the error accumulation.
Wikipedia agrees that this can be faster. The linked article also contains several other fast division algorithms that might be of interest.
I would guess that any industrial-strength modern compiler will make that optimization for you if it is going to profit you at all.
Your original
// original loop:
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
can easily be optimized to
// haha:
double y = 44100.0;
double z = x / y;
and the performance is pretty nice. ;-)
EDIT: People keep voting this down, so here's the not so funny version:
If there were a general way to make division faster for all cases, don't you think compiler writers might have happened upon it by now? Of course they would have done. Also, some of the people doing FPU circuits aren't exactly stupid, either.
So the only way you're going to get better performance is to know what specific situation you have at hand and doing optimal code for that. Most likely this is a complete waste of your time, because your program is slow for some other reason such as performing math on loop invariants. Go find a better algorithm instead.
In your example using gcc the division with the options -O3 -ffast-math yields the same code as the multiplication without -ffast-math. (In a testing environment with enough volatiles around that the loop is still there.)
So if you really want to optimise those divisions away and don’t care about the consequences, that’s the way to go. Multiplication seems to be roughly 15 times faster, btw.
multiplication is faster than division so the second method is faster. It might be slightly less accurate but unless you are doing hard core numerics the level of accuracy should be more than enough.
When processing audio, I prefer to use fixed point math instead. I suppose this depends on the level of precision you need. But, let's assume that 16.16 fixed point integers will do (meaning high 16 bits is whole number, low 16 is the fraction). Now, all calculation can be done as simple integer math:
unsigned int y = 44100 << 16;
unsigned int z = x / (y >> 16); // divisor must be the whole number portion
Or with macros to help:
#define FP_INT(x) (x << 16)
#define FP_MUL(x, y) (x * (y >> 16))
#define FP_DIV(x, y) (x / (y >> 16))
unsigned int y = FP_INT(44100);
unsigned int z = FP_MUL(x, y);
Audio, hunh? It's not just 44,100 divisions per second when you have, say, five tracks of audio running at once. Even a simple fader consumes cycles, after all. And that's just for a fairly bare-bones, minimal example -- what if you want to have, say, an eq and a compressor? Maybe a little reverb? Your total math budget, so to speak, gets eaten up quickly. It does make sense to wring out a little extra performance in those cases.
Profilers are good. Profilers are your friend. Profilers deserve blowjobs and pudding. But you already know where the main bottle neck is in audio work -- it's in the loop that processes samples, and the faster you can make that, the happier your users will be. Use everything you can! Multiply by reciprocals, shift bits whenever possible (exp(x*y) = exp (x)*exp(y), after all), use lookup tables, refer to variables by reference instead of values (less pushing/popping on the stack), refactor terms, and so forth. (If you're good, you'll laugh at these elementary optimizations.)
I presume from the original post that x is not a constant shown there but probably data from an array so x[i] is likely to be the source of the data and similarly for the output, it will be stored somewhere in memory.
I suggest that if the loop count really is 10,000 as in the original post that it will make little difference which you use as the whole loop won't even take a fraction of millisecond anyway on a modern cpu. If the loop count really is very much higher, perhaps 1,000,000 or more, then I would expect that the cost of memory access would likely make the faster operation completely irrelevent anyway as it will always be waiting for the data anyway.
I suggest trying both with your code and testing if it actually makes any significant difference in run time and if it doesn't then just write the straightforward division if that's what the algorithm needs.
here's the problem with doing it with a reciprocal, you still have to do the division before you can actually divide by Y. unless your only dividing by Y then i suppose this may be useful. this is not very practical since division is done in binary with similar algorithms.
I are looping 10,000 times simply to make the code take long enough to measure the time easily? Or do you have 10000 numbers to divide by the same number? If the former, put the "y_div = 1.0 / y;" inside the loop, because it's part of the operation.
If the latter, yes, floating point multiplication is generally faster than division. Don't change your code from the obvious to the arcane based on guesses, though. Benchmark first to find slow spots, and then optimize those (and take measurements before and after to make sure your idea actually causes an improvement)
On old CPUs like the 80286, floating point maths was abysmally slow and we employed lots of trickiness to speed things up.
On modern CPUs floating point maths is blindingly fast and optimising compilers can generally do wonders with fine-tuning things.
It is almost never worth the effort to employ little micro-optimisations like that.
Try to make your code simple and idiot-proof. Only of you find a real bottleneck (using a profiler) would you think of optimisations in your floating point calculations.