I am looking to generate derangements uniformly at random. In other words: shuffle a vector so that no element stays in its original place.
Requirements:
uniform sampling (each derangement is generated with equal probability)
a practical implementation is faster than the rejection method (i.e. keep generating random permutations until we find a derangement)
None of the answers I found so far are satisfactory in that they either don't sample uniformly (or fail to prove uniformity) or do not make a practical comparison with the rejection method. About 1/e = 37% of permutations are derangements, which gives a clue about what performance one might expect at best relative to the rejection method.
The only reference I found which makes a practical comparison is in this thesis which benchmarks 7.76 s for their proposed algorithm vs 8.25 s for the rejection method (see page 73). That's a speedup by a factor of only 1.06. I am wondering if something significantly better (> 1.5) is possible.
I could implement and verify various algorithms proposed in papers, and benchmark them. Doing this correctly would take quite a bit of time. I am hoping that someone has done it, and can give me a reference.
Here is an idea for an algorithm that may work for you. Generate the derangement in cycle notation. So (1 2) (3 4 5) represents the derangement 2 1 4 5 3. (That is (1 2) is a cycle and so is (3 4 5).)
Put the first element in the first place (in cycle notation you can always do this) and take a random permutation of the rest. Now we just need to find out where the parentheses go for the cycle lengths.
As https://mathoverflow.net/questions/130457/the-distribution-of-cycle-length-in-random-derangement notes, in a permutation, a random cycle is uniformly distributed in length. They are not randomly distributed in derangements. But the number of derangements of length m is m!/e rounded up for even m and down for odd m. So what we can do is pick a length uniformly distributed in the range 2..n and accept it with the probability that the remaining elements would, proceeding randomly, be a derangement. This cycle length will be correctly distributed. And then once we have the first cycle length, we repeat for the next until we are done.
The procedure done the way I described is simpler to implement but mathematically equivalent to taking a random derangement (by rejection), and writing down the first cycle only. Then repeating. It is therefore possible to prove that this produces all derangements with equal probability.
With this approach done naively, we will be taking an average of 3 rolls before accepting a length. However we then cut the problem in half on average. So the number of random numbers we need to generate for placing the parentheses is O(log(n)). Compared with the O(n) random numbers for constructing the permutation, this is a rounding error. However it can be optimized by noting that the highest probability for accepting is 0.5. So if we accept with twice the probability of randomly getting a derangement if we proceeded, our ratios will still be correct and we get rid of most of our rejections of cycle lengths.
If most of the time is spent in the random number generator, for large n this should run at approximately 3x the rate of the rejection method. In practice it won't be as good because switching from one representation to another is not actually free. But you should get speedups of the order of magnitude that you wanted.
this is just an idea but i think it can produce a uniformly distributed derangements.
but you need a helper buffer with max of around N/2 elements where N is the size of the items to be arranged.
first is to choose a random(1,N) position for value 1.
note: 1 to N instead of 0 to N-1 for simplicity.
then for value 2, position will be random(1,N-1) if 1 fall on position 2 and random(1,N-2) otherwise.
the algo will walk the list and count only the not-yet-used position until it reach the chosen random position for value 2, of course the position 2 will be skipped.
for value 3 the algo will check if position 3 is already used. if used, pos3 = random(1,N-2), if not, pos3 = random(1,N-3)
again, the algo will walk the list and count only the not-yet-used position until reach the count=pos3. and then position the value 3 there.
this will goes for the next values until totally placed all the values in positions.
and that will generate a uniform probability derangements.
the optimization will be focused on how the algo will reach pos# fast.
instead of walking the list to count the not-yet-used positions, the algo can used a somewhat heap like searching for the positions not yet used instead of counting and checking positions 1 by 1. or any other methods aside from heap-like searching. this is a separate problem to be solved: how to reached an unused item given it's position-count in a list of unused-items.
I'm curious ... and mathematically uninformed. So I ask innocently, why wouldn't a "simple shuffle" be sufficient?
for i from array_size downto 1: # assume zero-based arrays
j = random(0,i-1)
swap_elements(i,j)
Since the random function will never produce a value equal to i it will never leave an element where it started. Every element will be moved "somewhere else."
Let d(n) be the number of derangements of an array A of length n.
d(n) = (n-1) * (d(n-1) + d(n-2))
The d(n) arrangements are achieved by:
1. First, swapping A[0] with one of the remaining n-1 elements
2. Next, either deranging all n-1 remaning elements, or deranging
the n-2 remaining that excludes the index
that received A[0] from the initial matrix.
How can we generate a derangement uniformly at random?
1. Perform the swap of step 1 above.
2. Randomly decide which path we're taking in step 2,
with probability d(n-1)/(d(n-1)+d(n-2)) of deranging all remaining elements.
3. Recurse down to derangements of size 2-3 which are both precomputed.
Wikipedia has d(n) = floor(n!/e + 0.5) (exactly). You can use this to calculate the probability of step 2 exactly in constant time for small n. For larger n the factorial can be slow, but all you need is the ratio. It's approximately (n-1)/n. You can live with the approximation, or precompute and store the ratios up to the max n you're considering.
Note that (n-1)/n converges very quickly.
How can I solve the following problem in a way as effective as possible (perhaps linear complexity)?
As an input, I have some intervals (start, end) that holds its value (integer). There is no fixed number of intervals.
I want to find the group of intervals that are not overlaping AND the sum of their values is the highest possible (so the number of intervals doesn't matter as much as the result value).
I was thinking of implementing it as graph with evaluated edged and use perhaps Djikstra or something alike. But the problem is inserting into graph, which would take just way too much time. How can I make this better (or perhaps effective implementation of graph)?
This problem is know as weighted interval scheduling.
The idea is to sort the intervals by their right ends and then use dynamic programming to find the weight of the heaviest subset that ends in a specific interval or before it. You can use binary search to find the rigthmost interval that can be picked before the current one efficiently. The time complexity is O(N log N).
You can read more about it here: https://courses.cs.washington.edu/courses/cse521/13wi/slides/06dp-sched.pdf.
Start by sorting the intervals with respect to their start. Then, define the function f(i) that gives the maximum sum, considering all intervals including i and up. This function can be calculated with a dynamic program. To specify f(i), you have 2 options:
Include i and continue to the next interval that does not overlap i, so:
i.value + f(nextNonOverlappingInterval)
Do not include i and continue with the next interval:
f(nextInterval)
So overall, your function is:
f(i) = max(i.value + f(nextNonOverlappingInterval), f(nextInterval))
Set up a dynamic program to calculate f beginning at the last interval and propagate to the first, which then has your solution.
It is also possible to use alternative formulations of this problem to calculate from front to back or to use memoization to solve it.
I have input array A
A[0], A[1], ... , A[N-1]
I want function Max(T,A) which return B represent max value on A over previous moving window of size T where
B[i+T] = Max(A[i], A[i+T])
By using max heap to keep track of max value on current moving windows A[i] to A[i+T], this algorithm yields O(N log(T)) worst case.
I would like to know is there any better algorithm? Maybe an O(N) algorithm
O(N) is possible using Deque data structure. It holds pairs (Value; Index).
at every step:
if (!Deque.Empty) and (Deque.Head.Index <= CurrentIndex - T) then
Deque.ExtractHead;
//Head is too old, it is leaving the window
while (!Deque.Empty) and (Deque.Tail.Value > CurrentValue) do
Deque.ExtractTail;
//remove elements that have no chance to become minimum in the window
Deque.AddTail(CurrentValue, CurrentIndex);
CurrentMin = Deque.Head.Value
//Head value is minimum in the current window
it's called RMQ(range minimum query). Actually i once wrote an article about that(with c++ code). See http://attiix.com/2011/08/22/4-ways-to-solve-%C2%B11-rmq/
or you may prefer the wikipedia, Range Minimum Query
after the preparation, you can get the max number of any given range in O(1)
There is a sub-field in image processing called Mathematical Morphology. The operation you are implementing is a core concept in this field, called dilation. Obviously, this operation has been studied extensively and we know how to implement it very efficiently.
The most efficient algorithm for this problem was proposed in 1992 and 1993, independently by van Herk, and Gil and Werman. This algorithm needs exactly 3 comparisons per sample, independently of the size of T.
Some years later, Gil and Kimmel further refined the algorithm to need only 2.5 comparisons per sample. Though the increased complexity of the method might offset the fewer comparisons (I find that more complex code runs more slowly). I have never implemented this variant.
The HGW algorithm, as it's called, needs two intermediate buffers of the same size as the input. For ridiculously large inputs (billions of samples), you could split up the data into chunks and process it chunk-wise.
In sort, you walk through the data forward, computing the cumulative max over chunks of size T. You do the same walking backward. Each of these require one comparison per sample. Finally, the result is the maximum over one value in each of these two temporary arrays. For data locality, you can do the two passes over the input at the same time.
I guess you could even do a running version, where the temporary arrays are of length 2*T, but that would be more complex to implement.
van Herk, "A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels", Pattern Recognition Letters 13(7):517-521, 1992 (doi)
Gil, Werman, "Computing 2-D min, median, and max filters", IEEE Transactions on Pattern Analysis and Machine Intelligence 15(5):504-507 , 1993 (doi)
Gil, Kimmel, "Efficient dilation, erosion, opening, and closing algorithms", IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1606-1617, 2002 (doi)
(Note: cross-posted from this related question on Code Review.)
My program contains polygons which have the form of a vector containing points (2 dimensional double coordinates, stored in a self-made structure). I'm looking for a quick way of finding the smallest square containing my polygon (ie. knowing the maximal and minimal coordinates of all the points).
Is there a quicker way than just parsing all the points and storing the minimum and maximum values?
The algorithm ou are describing is straightforward: Iterate over all your points and find the minimum and maximum for each coordinate. This is an O(n) algorithm, n being the number of points you have.
You can't do better, since you will need to check at least all your points once, otherwise the last one could be outside the square you found.
Now, the complexity is at best O(n) so you just have to minimize the constant factors, but in that case it's already pretty small : Only one loop over your vector, looking for two maximums and two minimums.
You can either iterate through all points and find max and min values, or do some preprocessing, for example, store your points in treap (http://en.wikipedia.org/wiki/Treap).
There is no way w/o some preprocessing to do it better than just iterating over all points.
I'm not sure if there can be any faster way to find the min & max values in an array of values than linear time. The only 'optimization' I can think of is to find these values on one of the other occasions you're iterating the array (filling it/performing a function on all points), then perform checks on any data update.
Short version: how to most efficiently represent and add two random variables given by lists of their realizations?
Mildly longer version:
for a workproject, I need to add several random variables each of which is given by a list of values. For example, the realizations of rand. var. A are {1,2,3} and the realizations of B are {5,6,7}. Hence, what I need is the distribution of A+B, i.e. {1+5,1+6,1+7,2+5,2+6,2+7,3+5,3+6,3+7}. And I need to do this kind of adding several times (let's denote this number of additions as COUNT, where COUNT might reach 720) for different random variables (C, D, ...).
The problem: if I use this stupid algorithm of summing each realization of A with each realization of B, the complexity is exponential in COUNT. Hence, for the case where each r.v. is given by three values, the amount of calculations for COUNT=720 is 3^720 ~ 3.36xe^343 which will last till the end of our days to calculate:) Not to mention that in real life, the lenght of each r.v. is gonna be 5000+.
Solutions:
1/ The first solution is to use the fact that I am OK with rounding, i.e. having integer values of realizations. Like this, I can represent each r.v. as a vector and for at the index corresponding to a realization I have a value of 1 (when the r.v. has this realization once). So for a r.v. A and a vector of realizations indexed from 0 to 10, the vector representing A would be [0,1,1,1,0,0,0...] and the representation for B would be [0,0,0,0,0,1,1,1,0,0,10]. Now I create A+B by going through these vectors and do the same thing as above (sum each realization of A with each realization of B and codify it into the same vector structure, quadratic complexity in vector length). The upside of this approach is that the complexity is bound. The problem of this approach is that in real applications, the realizations of A will be in the interval [-50000,50000] with a granularity of 1. Hence, after adding two random variables, the span of A+B gets to -100K, 100K.. and after 720 additions, the span of SUM(A, B, ...) gets to [-36M, 36M] and even quadratic complexity (compared to exponential complexity) on arrays this large will take forever.
2/ To have shorter arrays, one could possibly use a hashmap, which would most likely reduce the number of operations (array accesses) involved in A+B as the assumption is that some non-trivial portion of the theoreical span [-50K, 50K] will never be a realization. However, with continuing summing of more and more random variables, the number of realizations increases exponentially while the span increases only linearly, hence the density of numbers in the span increases over time. And this would kill the hashmap's benefits.
So the question is: how can I do this problem efficiently? The solution is needed for calculating a VaR in electricity trading where all distributions are given empirically and are like no ordinary distributions, hence formulas are of no use, we can only simulate.
Using math was considered as the first option as half of our dept. are mathematicians. However, the distributions that we're going to add are badly behaved and the COUNT=720 is an extreme. More likely, we are going to use COUNT=24 for a daily VaR. Taking into account the bad behaviour of distributions to add, for COUNT=24 the central limit theorem would not hold too closely (the distro of SUM(A1, A2, ..., A24) would not be close to normal). As we're calculating possible risks, we'd like to get a number as precise as possible.
The intended use is this: you have hourly casflows from some operation. The distribution of cashflows for one hour is the r.v. A. For the next hour, it's r.v. B, etc. And your question is: what is the largest loss in 99 percent of cases? So you model the cashflows for each of those 24 hours and add these cashflows as random variables so as to get a distribution of the total casfhlow over the whole day. Then you take the 0.01 quantile.
Try to reduce the number of passes required to make the whole addition, possibly reducing it to a single pass for every list, including the final one.
I don't think you can cut down on the total number of additions.
In addition, you should look into parallel algorithms and multithreading, if applicable.
At this point, most processors are able to perform additions in parallel, given proper instrucions (SSE), which will make the additions many times faster(still not a cure for the complexity problem).
As you said in your question, you're going to need an awful lot of computation to get the exact answer. So it's not going to happen.
However, as you're dealing with random values, it would be possible to apply some mathmatics to the problem. Wouldn't the result of all these additions result in something that approaches the normal distribution? For example, consider rolling a single dice. Each number has equal probability so the realisations don't follow a normal distribution (actually, they probably do, there was a program on BBC4 last week about it and it showed that lottery balls had a normal distribution to their appearance). However, if you roll two dice and sum them, then the realisations do follow a normal distribution. So I think the result of your computation is going to approximate a normal distribution so it becomes a problem of finding the average value and the sigma value for a given set of inputs. You can workout the upper and lower bounds for each input as well as their averages and I'm sure a bit of Googling will provide methods for applying functions to normal distributions.
I guess there is a corollary question and that is what the results are used for? Knowing how the results are used will inform the decision on how the results are created.
Ignoring the programmatic solutions, you can cut down the total number of additions quite significantly as your data set grows.
If we define four groups W, X, Y and Z, each with three elements, by your own maths this leads to a large number of operations:
W + X => 9 operations
(W + X) + Y => 27 operations
(W + X + Y) + Z => 81 operations
TOTAL: 117 operations
However, if we assume a strictly-ordered definition of your "add" operation so that two sets {a,b} and {c,d} always result in {a+c,a+d,b+c,b+d} then your operation is associative. That means that you can do this:
W + X => 9 operations
Y + Z => 9 operations
(W + X) + (Y + Z) => 81 operations
TOTAL: 99 operations
This is a saving of 18 operations, for a simple case. If you extend the above to 6 groups of 3 members, the total number of operations can be dropped from 1089 to 837 - almost 20% saving. This improvement is more pronounced the more data you have (more sets or more elements will give more savings).
Further, this opens the problem to better parallelisation: if you have 200 groups to process, you can start by combining the 100 pairs in parallel, then the 50 pairs or results, then 25, etc. This will allow a large degree of parallelism that should give you much better performance. (For example, 720 sets would be added in ~10 parallel operations as each parallel add will allow increasing COUNT by a factor of 2.)
I'm absolutely no expert on this, but it would seem an ideal problem for using the parallel procesing capability of a typical GPU - my understanding is that something like CUDA would make short work of processing all these calculations in parallel.
EDIT: If your real question is "what's your largest loss" then this is a much easier problem. Given that every value in the ultimate set is the sum of one value from each "component" set, your biggest loss will generally be found by combining the lowest value from each component set. Finding these lower values (one value per set) is a much simpler job, and you then only need sum together that limited set of values.
There are basically two methods. An approximative one and an exact one...
Approximative method models the sum of random variables by a lot of samplings. Basically, having random variables A, B we randomly sample from each r.v. 50K times, add the sampled values (here SSE can help a lot) and we have a distribution of A+B. This is how mathematicians would do this in Mathematica.
Exact method utilizes something Dan Puzey proposed, namely summing only some small portion of each r.v.'s density. Let's say we have random variables with the following "densities" (where each value is of the same likelihood for simplicity sake)
A = {-5,-3,-2}
B = {+0,+1,+2}
C = {+7,+8,+9}
The sum of A+B+C is going to be
{2,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9}
and if I want to know the whole distribution precisely, I have no other choice than summing each elem of A with each elem of B and then each elem of this sum with each elem of C. However, if I only want the 99% VaR of this sum, i.e. 1% percentile of this sum, I only have to sum the smallest elements of A,B,C.
More precisely, I will take nA,nB,nC smallest elements from each distribution. To determine nA,nB,nC let's set these to 1 first. Then, increase nA by one if A[nA] = min( A[nA], B[nB], C[nC]) (counting on that A,B,C are sorted). This way, I can get the nA, nB, nC smallest elements of A,B,C which I will have to sum together (each with each other) and take the X-th smallest sum (where X is 1% multiplied by total combination count of sums, i.e. 3*3*3 for A,B,C). This also tells when to stop increasing nA,nB,nC - stop when nA*nB*nC > X.
However, like this I am doing the same redundancy again, i.e. I am calculating the whole distribution of A+B+C left of the 1% percentile. Even this will be MUCH shorter than calculating the whole distro of A+B+C, however. But I believe there should be a simple iterative algo to tell exaclty the the given VaR number in O(a*b) where a is the number of added r.v.s and b is the max number of elements in the density of each r.v.
I will be glad for any comments on whether I am correct.