How approximation search works - c++

[Prologue]
This Q&A is meant to explain more clearly the inner working of my approximations search class which I first published here
Increasing accuracy of solution of transcendental equation
I was requested for more detailed info about this few times already (for various reasons) so I decided to write Q&A style topic about this which I can easily reference in the future and do not need to explain it over and over again.
[Question]
How to approximate values/parameters in Real domain (double) to achieve fitting of polynomials,parametric functions or solve (difficult) equations (like transcendental) ?
Restrictions
real domain (double precision)
C++ language
configurable precision of approximation
known interval for search
fitted value/parameter is not strictly monotonic or not function at all

Approximation search
This is analogy to binary search but without its restrictions that searched function/value/parameter must be strictly monotonic function while sharing the O(log(n)) complexity.
For example Let assume following problem
We have known function y=f(x) and want to find x0 such that y0=f(x0). This can be basically done by inverse function to f but there are many functions that we do not know how to compute inverse to it. So how to compute this in such case?
knowns
y=f(x) - input function
y0 - wanted point y value
a0,a1 - solution x interval range
Unknowns
x0 - wanted point x value must be in range x0=<a0,a1>
Algorithm
probe some points x(i)=<a0,a1> evenly dispersed along the range with some step da
So for example x(i)=a0+i*da where i={ 0,1,2,3... }
for each x(i) compute the distance/error ee of the y=f(x(i))
This can be computed for example like this: ee=fabs(f(x(i))-y0) but any other metrics can be used too.
remember point aa=x(i) with minimal distance/error ee
stop when x(i)>a1
recursively increase accuracy
so first restrict the range to search only around found solution for example:
a0'=aa-da;
a1'=aa+da;
then increase precision of search by lowering search step:
da'=0.1*da;
if da' is not too small or if max recursions count is not reached then go to #1
found solution is in aa
This is what I have in mind:
On the left side is the initial search illustrated (bullets #1,#2,#3,#4). On the right side next recursive search (bullet #5). This will recursively loop until desired accuracy is reached (number of recursions). Each recursion increase the accuracy 10 times (0.1*da). The gray vertical lines represent probed x(i) points.
Here the C++ source code for this:
//---------------------------------------------------------------------------
//--- approx ver: 1.01 ------------------------------------------------------
//---------------------------------------------------------------------------
#ifndef _approx_h
#define _approx_h
#include <math.h>
//---------------------------------------------------------------------------
class approx
{
public:
double a,aa,a0,a1,da,*e,e0;
int i,n;
bool done,stop;
approx() { a=0.0; aa=0.0; a0=0.0; a1=1.0; da=0.1; e=NULL; e0=NULL; i=0; n=5; done=true; }
approx(approx& a) { *this=a; }
~approx() {}
approx* operator = (const approx *a) { *this=*a; return this; }
//approx* operator = (const approx &a) { ...copy... return this; }
void init(double _a0,double _a1,double _da,int _n,double *_e)
{
if (_a0<=_a1) { a0=_a0; a1=_a1; }
else { a0=_a1; a1=_a0; }
da=fabs(_da);
n =_n ;
e =_e ;
e0=-1.0;
i=0; a=a0; aa=a0;
done=false; stop=false;
}
void step()
{
if ((e0<0.0)||(e0>*e)) { e0=*e; aa=a; } // better solution
if (stop) // increase accuracy
{
i++; if (i>=n) { done=true; a=aa; return; } // final solution
a0=aa-fabs(da);
a1=aa+fabs(da);
a=a0; da*=0.1;
a0+=da; a1-=da;
stop=false;
}
else{
a+=da; if (a>a1) { a=a1; stop=true; } // next point
}
}
};
//---------------------------------------------------------------------------
#endif
//---------------------------------------------------------------------------
This is how to use it:
approx aa;
double ee,x,y,x0,y0=here_your_known_value;
// a0, a1, da,n, ee
for (aa.init(0.0,10.0,0.1,6,&ee); !aa.done; aa.step())
{
x = aa.a; // this is x(i)
y = f(x) // here compute the y value for whatever you want to fit
ee = fabs(y-y0); // compute error of solution for the approximation search
}
in the rem above for (aa.init(... are the operand named. The a0,a1 is the interval on which the x(i) is probed, da is initial step between x(i) and n is the number of recursions. so if n=6 and da=0.1 the final max error of x fit will be ~0.1/10^6=0.0000001. The &ee is pointer to variable where the actual error will be computed. I choose pointer so there are not collisions when nesting this and also for speed as passing parameter to heavily used function creates heap trashing.
[notes]
This approximation search can be nested to any dimensionality (but of coarse you need to be careful about the speed) see some examples
Approximation of n points to the curve with the best fit
Curve fitting with y points on repeated x positions (Galaxy Spiral arms)
Increasing accuracy of solution of transcendental equation
Find Minimum area ellipse enclosing a set of points in c++
2D TDoA Time Difference of Arrival
3D TDoA Time Difference of Arrival
In case of non-function fit and the need of getting "all" the solutions you can use recursive subdivision of search interval after solution found to check for another solution. See example:
Given an X co-ordinate, how do I calculate the Y co-ordinate for a point so that it rests on a Bezier Curve
What you should be aware of?
you have to carefully choose the search interval <a0,a1> so it contains the solution but is not too wide (or it would be slow). Also initial step da is very important if it is too big you can miss local min/max solutions or if too small the thing will got too slow (especially for nested multidimensional fits).

A combination of secant (with bracketing, but see correction at the bottom) and bisection method is much better (credit for the original graphics of course due to user Spektre in their answer above):
We find root approximations by secants, and keep the root bracketed as in bisection.
always keep the two edges of the interval so that the delta at one edge is negative, and at the other it is positive, so the root is guaranteed to be inside; and instead of halving, use the secant method.
Pseudocode:
Given a function f,
Given two points a, b, such that a < b and sign(f(a)) /= sign(f(b)),
Given tolerance tol,
TO FIND root z of f such that abs(f(z)) < tol -- stop_condition
DO:
x = root of f by linear interpolation of f between a and b
m = midpoint between a and b
if stop_condition holds at x or m, set z and STOP
[a,b] := [a,x,m,b].sort.choose_shortest_interval_with_
_opposite_signs_at_its_ends
This obviously halves the interval [a,b], or does even better, at each iteration; so unless the function is extremely bad behaving (like, say, sin(1/x) near x=0), this will converge very quickly, taking only two evaluations of f at the most, for each iteration step.
And we can detect the bad behaving cases by checking that b-a not becomes too small (esp. if we're working with finite precision, as in doubles).
update: apparently this is actually double false position method, which is secant with bracketing, as described by the pseudocode above. Augmenting it by the middle point as in bisection ensures convergence even in the most pathological cases.

Related

Evaluating multiplication with exponential function

I'm trying to come up with a good way to evaluate the following function
double foo(std::vector<double> const& x, double c = 0.95)
{
auto N = x.size(); // Small power of 2 such as 512 or 1024
double sum = 0;
for (auto i = 0; i != N; ++i) {
sum += (x[i] * pow(c, double(i)/N));
}
return sum;
}
My two main concerns with this naive implementation are performance and accuracy. So I suspect that the most trivial improvement would be to reverse the loop order: for (auto i = N-1; i != -1; --i) (The -1 wraps around, this is OK). This improves accuracy by adding smaller terms first.
While this is good for accuracy, it keeps the performance problem of pow. Numerically, pow(c, double(i)/N) is pow(c, (i-1)/N) * pow(c, 1/N). And the latter is a constant. So in theory we can replace pow with repeated multiplication. While good for performance, this hurts accuracy - errors will accumulate.
I suspect that there's a significantly better algorithm hiding in here. For instance, the fact that N is a power of two means that there is a middle term x[N/2] that's multiplied with sqrt(c). That hints at a recursive solution.
On a somewhat related numerical observation, this looks like a signal multiplication with an exponential, so I naturally think : "FFT, trivial convolution=shift, IFFT", but that seems to offer no real benefit in terms of accuracy or performance.
So, is this a well-known problem with known solutions?
The task is a polynomial evaluation. The method for a single evaluation with the least operation count is the Horner scheme. In general a low operation count will reduce the accumulation of floating point noise.
As the example value c=0.95 is close to 1, any root will be still closer to 1 and thus lose accuracy. Avoid that by computing the difference to 1 directly, z=1-c^(1/n), via
z = -expm1(log(c)/N).
Now you have to evaluate the polynomial
sum of x[i] * (1-z)^i
which can be done by careful modification of the Horner scheme. Instead of
for(i=N; i-->0; ) {
res = res*(1-z)+x[i]
}
use
for(i=N; i-->0; ) {
res = (res+x[i])-res*z
}
which is mathematically equivalent but has the loss of digits in 1-z happening as late as possible without using more involved method like doubly accurate addition.
In tests those two methods contrary to the intent gave almost the same results, a substantial improvement could be observed by separating the result into its value at c=1, z=0 and a multiple of z as in
double res0 = 0, resz=0;
int i;
for(i=N; i-->0; ) {
/* res0+z*resz = (res0+z*resz)*(1-z)+x[i]; */
resz = resz - res0 -z*resz;
res0 = res0 + x[i];
}
The test case that showed this improvement was for the coefficient sequence of
f(u) = (1-u/N)^(N-2)*(1-u)
where for N=1000 the evaluations result in
c z=1-c^(1/N) f(1-z) diff for 1st proc diff for 3rd proc
0.950000 0.000051291978909 0.000018898570629 1.33289104579937e-17 4.43845264361253e-19
0.951000 0.000050239954368 0.000018510931892 1.23765066121009e-16 -9.24959978401696e-19
0.952000 0.000049189034371 0.000018123700958 1.67678642238461e-17 -5.38712954453735e-19
0.953000 0.000048139216599 0.000017736876972 -2.86635949350855e-17 -2.37169225231204e-19
...
0.994000 0.000006018054217 0.000002217256601 1.31645860662263e-17 1.15619997300212e-19
0.995000 0.000005012529261 0.000001846785028 -4.15668713370839e-17 -3.5363625547867e-20
0.996000 0.000004008013365 0.000001476685973 8.48811716443534e-17 8.470329472543e-22
0.997000 0.000003004504507 0.000001106958687 1.44711343873661e-17 -2.92226366802734e-20
0.998000 0.000002002000667 0.000000737602425 5.6734266807093e-18 -6.56450534122083e-21
0.999000 0.000001000499833 0.000000368616443 -3.72557383333555e-17 1.47701370177469e-20
Yves' answer inspired me.
It seems that the best approach is to not calculate pow(c, 1.0/N) directly, but indirectly:
cc[0]=c; cc[1]=sqrt(cc[0]), cc[2]=sqrt(cc[1]),... cc[logN] = sqrt(cc[logN-1])
Or in binary,
cc[0]=c, cc[1]=c^0.1, cc[2]=c^0.01, cc[3]=c^0.001, ....
Now if we need x[0b100100] * c^0.100100, we can calculate that as x[0b100100]* c^0.1 * c^0.0001. I don't need to precalculate a table of size N, as geza suggested. A table of size log(N) is probably sufficient, and it can be created by repeatedly taking square roots.
[edit]
As pointed out in a comment thread on another answer, pairwise summation is very effective in keeping errors under control. And it happens to combine extremely nicely with this answer.
We start by observing that we sum
x[0] * c^0.0000000
x[1] * c^0.0000001
x[2] * c^0.0000010
x[3] * c^0.0000011
...
So, we run log(N) iterations. In iteration 1, we add the N/2 pairs x[i]+x[i+1]*c^0.000001 and store the result in x[i/2]. In iteration 2, we add the pairs x[i]+x[i+1]*c^0.000010, etcetera. The chief difference with normal pairwise summation is that this is a multiply-and-add in each step.
We see now that in each iteration, we're using the same multiplier pow(c, 2^i/N), which means we only need to calculate log(N) multipliers. It's also quite cache-efficient, as we're doing only contiguous memory access. It also allows for easy SIMD parallelization, especially when you have FMA instructions.
If N is a power of 2, you can replace the evaluations of the powers by geometric means, using
a^(i+j)/2 = √(a^i.a^j)
and recursively subdivide from c^N/N.c^0/N. With preorder recursion, you can make sure to accumulate by increasing weights.
Anyway, the speedup of sqrt vs. pow might be marginal.
You can also stop recursion at a certain level and continue linearly, with mere products.
You could mix repeated multiplication by pow(c, 1./N) with some explicit pow calls. I.e. every 16th iteration or so do a real pow and otherwise move forward with the multiply. This should yield large performance benefits at negligible accuracy cost.
Depending on how much c varies, you might even be able to precompute and replace all pow calls with a lookup, or just the ones needed in the above method (= smaller lookup table = better caching).

Optimization accumulation of vectors for monte carlo simulation

I want to optimize the following code:
During a monte carlo simulation I accumulate some quantities f(x) (f(x) is expensive to compute) and save them in the array bins after every sampling step.
EDIT: f(x) is not a deterministic function of x (by that i mean it generates pseudo random numbers and uses them to modify the result) and also depends on previoulsy calculated values f(y)
for(int n=0;n<N;n++)
{
// compute some values f(x) at points "p"
for(auto k: p) bins[k] += f(k);
}
p.size() is much smaller than the size of bins, but eventually most elements will be set.
After the simulation I accumulate my final values by doing a weighted sum over bins (g is a lookup in another array):
for(int l=0;l<M;l++)
for(int k=0;k<bins.size();k++)
finalResult[l] += g(k,l)*bins[k];
I could of course compute my updated finalResult after every sampling step, this does however slow the program down a lot, due to the loop over M.
I already tried a very basic boost::accumulate, but this did not improve performance (if I stay with this design I will have to use it eventually due to stability, though).
All arrays are of type Eigen::MatrixXd since I need them for BLAS operations.
p.size() < 10^2
N ~ 10^7
M ~ 10^4
bins.size() ~ 10^5
Do you have any suggestions on which techniques could be useful for optimization here?
Try computing f(x) just once for each of the N values (i.e. memoization). So for instance, if N is large (like it is in this situation), try changing your loop to something like the following:
static std::unordered_map<unsigned int, double> memoizedFunction;
for(int n=0;n<N;n++)
{
// compute some values f(x) at points "p"
for(auto k: p)
{
auto it = memoizedFunction.find( k );
if (it == memoizedFunction.end())
{
it = memoizedFunction.emplace( f(k) ).first;
}
bins[k] += *it;
}
}
Alternatively, you could just store the number of times the kth bin has been hit in bins[k] and then at the end go through and compute bins[k] * f(k) for each k.
Just a thought here but you if you could verify that f(x) is a linear
transformation then you could create the matrix A such that
[f(x)] = A[x] where [x] is the coordinates of x with respect to some basis B.
That could make f(x) easier and faster to compute especially if x
exists in a vector space with a small basis.
However if converting between coordinates and the answer is expensive
that could overall kill any benefits (just keep that in mind).
Here are some links that could help explain matrix representation of
linear transformations.
https://math.colorado.edu/~nita/MatrixRepresentations.pdf
https://math.dartmouth.edu/archive/m24w07/public_html/Lecture12.pdf
https://en.wikipedia.org/wiki/Transformation_matrix

summing array of doubles with large value span : proper algorithm

I have an algorithm where I need to sum (a lot of time) double numbers ranging in the e-40 to the e+40.
Array Example (randomly dumped from real application):
-2.06991e-05
7.58132e-06
-3.91367e-06
7.38921e-07
-5.33143e-09
-4.13195e-11
4.01724e-14
6.03221e-17
-4.4202e-20
6.58873
-1.22257
-0.0606178
0.00036508
2.67599e-07
0
-627.061
-59.048
5.92985
0.0885884
0.000276455
-2.02579e-07
It goes without saying the I am aware of the rounding effect this will cause, I am trying to keep it under control : the final result should not have any missing information in the fractional part of the double or, if not avoidable result should be at least n-digit accurate (with n defined). End result needs something like 5 digits plus exponent.
After some decent thinking, I ended up with following algorithm :
Sort the array so that the largest absolute value comes first, closest to zero last.
Add everything in a loop
The idea is that in this case, any cancellation of large values (negatives and positive) will not impact latter smaller values.
In short :
(10e40 - 10e40) + 1 = 1 : result is as expected
(1 + 10e-40) - 10e40 = 0 : not good
I ended up using std::multiset (benchmark on my PC gave 20% higher speed with long double compared to normal doubles - I am fine with doubles resolution) with a custom sort function using std:fabs.
It's still quite slow (it takes 5 seconds to do the whole thing) and I still have this feeling of "you missed something in your algo". Any recommandation :
for speed optimization. Is there a better way to sort the intermediate products ? Sorting a set of 40 intermediate results (typically) takes about 70% of the total execution time.
for missed issues. Is there a chance to still lose critical data (one that should have been in the fractional part of the final result) ?
On a bigger picture, I am implementing real coefficient polynomial classes of pure imaginary variable (electrical impedances : Z(jw)). Z is a big polynom representing a user defined system, with coefficient exponent ranging very far.
The "big" comes from adding things like Zc1 = 1/jC1w to Zc2 = 1/jC2w :
Zc1 + Zc2 = (C1C2(jw)^2 + 0(jw))/(C1+C2)(jw)
In this case, with C1 and C2 in nanofarad (10e-9), C1C2 is already in 10e-18 (and it only started...)
my sort function use a manhattan distance of complex variables (because, mine are either pure real or pure imaginary) :
struct manhattan_complex_distance
{
bool operator() (std::complex<long double> a, std::complex<long double> b)
{
return std::fabs(std::real(a) + std::imag(a)) > std::fabs(std::real(b) + std::imag(b));
}
};
and my multi set in action :
std:complex<long double> get_value(std::vector<std::complex<long double>>& frequency_vector)
{
//frequency_vector is precalculated once for all to have at index n the value (jw)^n.
std::multiset<std::complex<long double>, manhattan_distance> temp_list;
for (int i=0; i<m_coeficients.size(); ++i)
{
// element of : ℝ * ℂ
temp_list.insert(m_coeficients[i] * frequency_vector[i]);
}
std::complex<long double> ret=0;
for (auto i:temp_list)
{
// it is VERY important to start adding the big values before adding the small ones.
// in informatics, 10^60 - 10^60 + 1 = 1; while 1 + 10^60 - 10^60 = 0. Of course you'd expected to get 1, not 0.
ret += i;
}
return ret;
}
The project I have is c++11 enabled (mainly for improvement of the math lib and complex number tools)
ps : I refactored the code to make is easy to read, in reality all complexes and long double names are template : I can change the polynomial type in no time or use the class for regular polynomial of ℝ
As GuyGreer suggested, you can use Kahan summation:
double sum = 0.0;
double c = 0.0;
for (double value : values) {
double y = value - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
EDIT: You should also consider using Horner's method to evaluate the polynomial.
double value = coeffs[degree];
for (auto i = degree; i-- > 0;) {
value *= x;
value += coeffs[i];
}
Sorting the data is on the right track. But you definitely should be summing from smallest magnitude to largest, not from largest to smallest. Summing from largest to smallest, by the time you get to the smallest, aligning the next value with the current sum is liable to cause most or all of the bits of the next value to 'fall off the end'. Summing instead from smallest to largest, the smallest values get a chance to accumulate a decent-sized sum, for which more bits will get into the largest. Combined with Kahan summation, that should yield a fairly accurate sum.
First: have your math keep track of error. Replace your doubles with error-aware types, and when you add or multiply together two doubles it also calculates the maximium error.
This is about the only way you can guarantee that your code produces accurate results while being reasonably fast.
Second, don't use a multiset. The associative containers are not for sorting, they are for maintaining a sorted collection, while being able to incrementally add or remove elements from it efficiently.
The ability to add/remove elements incrementally means it is node-based, and node-based means it is slow in general.
If you simply want a sorted collection, start with a vector then std::sort it.
Next, to minimize error, keep a list of positive and negative elements. Start with zero as your sum. Now pick the smallest of either the positive or negative elements such that the total of your sum and that element is closest to zero.
Do so with elements that calculate their error bounds.
At the end, determine if you have 5 digits of precision, or not.
These error-propogating doubles should be ideally used as early on in the algorithm as possible.

Spline options for interpolating between control points on a population curve?

I'm looking to model population curves using an interpolative spline between 7 control points. My problem is that I can't find any grokkable/digestable coding/math resource that compares the pros and cons of various splines in layman's terms.
First, here's a simplified illustration of the population curve in my code:
struct CurvePoint {
public:
short height; // the point's height along a population curve, measured as age in years (can be negative, representing a building trend whose peak/valley will happen in the future)
double width; // the (nonnegative) width of the population curve at this point, measured as number of people **born within a single year**
// Each CurvePoint represents one "bar" of a population pyramid that's one year tall.
};
class PopulationCurve {
public:
std::array<CurvePoint, 7> control_points; // assumes that control_points[i].height < control_points[i + 1].height and that control_points[i].width >= 0
// control_points[0] is the young end of the curve (typically a nonpositive number; see above)
// control_points[6] is the old end of the curve (typically representing the longevity of the population; if so, control_points[6].width will be 0 since no one is left alive at that point)
std::vector<CurvePoint> constructCurve() {
std::vector<CurvePoint> curve;
short youngest_age = control_points[0].height;
short oldest_age = control_points[6].height;
for (auto a = youngest_age; a <= oldest_age; ++a) {
CurvePoint p;
p.height = a;
// p.width = ??? (the interpolated width at age a)
curve.push_back(p);
}
return curve;
}
void deconstructCurve(std::vector<CurvePoint> curve) {
std::array<CurvePoint, 7> sampled_control_points;
// ??? (turn point samples from the input curve into control points as appropriate)
control_points = sampled_control_points;
}
};
The hardcoding of 7 control points is intentional. I'm implementing a choice between two compression schemes: virtually lossless compression of 7 control points in 44 bytes, and lossy compression of 7 control points in 20 bytes (my application is currently more memory/disk-limited than CPU-limited). I don't believe those compression schemes are relevant to the question, but let me know if I need to show their code, especially if there's a good reason I should be considering <7 or >7 control points.
Here are the criteria I'm looking for in a spline, in descending order of importance:
Interpolation between all control points. This is by far the most important criterion; otherwise, I would've used a Bézier curve or b-spline.
Interpolation between the first and last control point only. If all points aren't interpolated between, then only the first and last can be (i.e. what a Bézier curve or b-spline would've got me).
Fast constructability + deconstructability. There's a near-1:1 correlation between constructCurve() and deconstructCurve(); almost every call to construct will eventually be followed up by a call to deconstruct, so I'm only interested in the combined performance and the not the performance of either one individually. That being said, while I'm very interested in memory/disk optimization right now, I am not going to prematurely optimize speed, so this is a consideration only.
Reasonably accurate deconstructability.
Lossless deconstructability if no changes to the curve are made. i.e. If deconstructCurve(constructCurve()); is called, control_points will remain the same.
Prettiness =) (since linear interpolation between control points is the best match for the rest of the criteria...)
I didn't post this question on math since it's not entirely language-agnostic and contains C++ code. I didn't post it on gamedev since it's a question of implementation and not design.

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.