Related
I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.
I have the following problem:
Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.
At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).
Would there happen to be a more efficient algorithm?
An example of the problem:
Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000
A comprehensive way to solve this would be to:
Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
Select a uniformly random integer i in the range [0,|X|).
Select the i'th combination from X as the result.
This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.
Note: The following is a C++ implementation of the solution provided by pentadecagon
std::vector<int> generate_random(const int n, const int m, const int k)
{
if ((n < m) || (m < k))
return std::vector<int>();
std::random_device source;
std::mt19937 generator(source());
std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);
std::vector<int> result_list;
result_list.reserve(m);
for (int i = 0; i < m; ++i)
{
result_list.push_back(distribution(generator));
}
std::sort(std::begin(result_list),std::end(result_list));
for (int i = 0; i < m; ++i)
{
result_list[i] += (i * k);
}
return result_list;
}
http://ideone.com/KOeR4R
.
EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.
Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers
b_i=a_i + i*(K-1)
Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:
import random
def random_list( N, M, K ):
s = set()
while len(s) < M:
s.add( random.randint( 1, N-(M-1)*(K-1) ) )
res = sorted( s )
for i in range(M):
res[i] += i * (K-1)
return res
First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.
Bijection:
Let
Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).
From this we get a valid solution
by setting the values mi as follows:
We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)
To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let
be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:
Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:
The third equality follows since we have this telescoping sum that cancels out almost all mi.
So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.
Picking a composition uniformly at random
Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.
Note: #DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.
You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).
Why not do this:
for (int i = 0; i < M; ++i) {
pick a random number between K and N/M
add this number to (N/M)* i;
Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)
EDIT:
Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.
Second EDIT:
Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.
Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.
Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.
There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.
Third and Final EDIT:
For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:
int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
maxRange = N - (((M - 1) - i) * K) - prevValue;
int nextValue = random(0, maxRange);
prevValue += nextValue;
store previous value;
prevValue += K;
}
This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.
Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:
What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.
Rather than pseudocode, let me work an example using OP's original example:
0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))
In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.
By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.
Hope one of these ideas works for you!
I did write a code for doing this:
for i in xrange(len(Derivative)):
if ((Derivative[i-1] > Derivative[i]) and (Derivative[i+1] < Derivative[i]) and (Derivative[i-1] > 0.0) and (Derivative[i+1] < 0.0) and (Derivative[i] > 0.0)):
print str(i+1)
Here Derivative is list in which I have to detect zero crossing mainly the ones where the values just before zero crossing is positive and one after zero crossing to be negative.
I have attached the graph of Derivative to further elucidate the problem!
I wish to know if there is better way of doing this in Python, I mean shorter and precise code ?
You only need to compare the current derivative with the one following it. As a result, you can delete any references to Derivative[i-1]. If Derivative[i] is greater than zero, and Derivative[i+1] is less than zero, then necessarily Derivative[i+1] < Derivative[i]. So you can delete that condition as well. This leaves:
for i in xrange(len(Derivative)-1):
if Derivative[i] > 0 and Derivative[i+1] < 0:
print "crossing between indexes {} and {}".format(i, i+1)
Also, I shortened the argument to xrange by one. Otherwise, you'd get list index out of range.
I have an array of 20 numbers (64 bit int) something like 10, 25, 36,43...., 118, 121 (sorted numbers).
Now, I have to give millions of numbers as input (say 17, 30).
What I have to give as output is:
for Input 17:
17 is < 25 and > 10. So, output will be index 0.
for Input 30:
30 is < 36 and > 25. So, output will be index 1.
Now, I can do it using linear search, binary serach. Is there any method to do it faster way ? Input numbers are random (gaussian).
If you know the distribution, you can direct your search in a smarter way.
Here is the rough idea of this variant of binary search:
Assuming that your data is expected to be distributed uniformly on 0 to 100.
If you observe the value 0, you start at the beginning. If your value is 37, you start at 37% of the array you have. This is the key difference to binary search: you don't always start at 50%, but you try to start in the expected "optimal" position.
This also works for Gaussian distributed data, if you know the parameters (If you don't know them, you can still estimate them easily from the observed data). You would compute the Gaussian CDF, and this yields the place to start your search.
Now for the next step, you need to refine your search. At the position you looked at, there was a different value. You can use this to re-estimate the position to continue searching.
Now even if you don't know the distribution this can work very well. So you start with a binary search, and looked at objects at 50% and 25% already. Instead of going to 37.5% next, you can do a better guess, if your query values was e.g. very close to the 50% entry. Unless your data set is very "clumpy" (and your queries are not correlated to the data) then this should still outperform "naive" binary search that always splits in the middle.
http://en.wikipedia.org/wiki/Interpolation_search
The expected average runtime apparently is O(log(log(n)), from Wikipedia.
Update: since someone complained that with just 20 numbers things are different. Yes, they are. With 20 numbers linear search may be best. Because of CPU caching. Linear scanning through a small amount of memory - that fits into the CPU cache - can be really fast. In particular with an unrolled loop. But that case is quite pathetic and uninteresting IMHO.
I believe best option for you is to use upper_bound - it will find the first value in the array bigger than the one you are searching for.
Still depending on the problem you try to solve maybe lower_bound or binary_search may be the thing you need.
All of these algorithms are with logarithmic complexity.
There is nothing will be better than binary search since your array is sorted.
Linear search is O(n) while binary search is O(log n)
Edit:
Interpolation search makes an extra assumption (the elements have to be uniformly distributed) and do more comparisons per iteration.
You can try both and empirically measure which is better for your case
In fact, this problem is quite interesting because it is a re-cast of an information theoretic framework.
Given 20 numbers, you will end up with 21 bins (including < first one and > last one).
For each incoming number, you are to map to one of these 21 bins. This mapping is done by comparison. Each comparison gives you 1 bit of information (< or >= -- two states).
So suppose the incoming number requires 5 comparisons in order to figure out which bin it belongs to, then it is equivalent to using 5 bits to represent that number.
Our goal is to minimize the number of comparisons! We have 1 million numbers each belonging to 21 ordered code words. How do we do that?
This is exactly an entropy compression problem.
Let a[1],.. a[20], be your 20 numbers.
Let p(n) = pr { incoming number is < n }.
Build the decision tree as follows.
Step 1.
let i = argmin |p(a[i]) - 0.5|
define p0(n) = p(n) / (sum(p(j), j=0...a[i-1])), and p0(n)=0 for n >= a[i].
define p1(n) = p(n) / (sum(p(j), j=a[i]...a[20])), and p1(n)=0 for n < a[i].
Step 2.
let i0 = argmin |p0(a[i0]) - 0.5|
let i1 = argmin |p1(a[i1]) - 0.5|
and so on...
and by the time we're done, we end up with:
i, i0, i1, i00, i01, i10, i11, etc.
each one of these i gives us the comparison position.
so now our algorithm is as follows:
let u = input number.
if (u < a[i]) {
if (u < a[i0]) {
if (u < a[i00]) {
} else {
}
} else {
if (u < a[i01]) {
} else {
}
}
} else {
similarly...
}
so the i's define a tree, and the if statements are walking the tree. we can just as well put it into a loop, but it's easier to illustrate with a bunch of if.
so for example, if you knew that your data were uniformly distributed between 0 and 2^63, and your 20 number were
0,1,2,3,...19
then
i = 20 (notice that there is no i1)
i0 = 10
i00 = 5
i01 = 15
i000 = 3
i001 = 7
i010 = 13
i011 = 17
i0000 = 2
i0001 = 4
i0010 = 6
i0011 = 9
i00110 = 8
i0100 = 12
i01000 = 11
i0110 = 16
i0111 = 19
i01110 = 18
ok so basically, the comparison would be as follows:
if (u < a[20]) {
if (u < a[10]) {
if (u < a[5]) {
} else {
...
}
} else {
...
}
} else {
return 21
}
so note here, that I am not doing binary search! I am first checking the end point. why?
there is 100*((2^63)-20)/(2^63) percent chance that it will be greater than a[20]. this is basically like 99.999999999999999783159565502899% chance!
so this algorithm as it is has an expected number of comparison of 1 for a dataset with the properties specified above! (this is better than log log :p)
notice what I have done here is I am basically using fewer compares to find numbers that are more probable and more compares to find numbers that are less probable. for example, the number 18 requires 6 comparisons (1 more than needed with binary search); however, the numbers 20 to 2^63 require only 1 comparison. this same principle is used for lossless (entropy) data compression -- use fewer bits to encode code words that appear often.
building the tree is a one time process and you can use the tree 1 million times later.
the question is... when does this decision tree become binary search? homework exercise! :p the answer is simple. it's similar to when you can't compress a file any more.
ok, so I didn't pull this out of my behind... the basis is here:
http://en.wikipedia.org/wiki/Arithmetic_coding
You could perform binary search using std::lower_bound and std::upper_bound. These give you back iterators, so you can use std::distance to get an index.
The Test Cover problem can be defined as follows:
Suppose we have a set of n diseases and a set of m tests we can perform to check for symptoms. We also are given the following:
an nxn matrix A where A[i][j] is a binary value representing the result of running the jth test on a patient with the the ith disease (1 indicates a positive result, 0 indicates negative);
the cost of running test j, c_j; and that
any patient will have exactly one disease
The task is to find a set of tests that can uniquely identify each of the the n diseases at minimal cost.
This problem can be formulated as an Integer Linear Program, where we want to minimize the objective function \sum_{j=1}^{m} c_j x_j, where x_j = 1 if we choose to include test j in our set, and 0 otherwise.
My question is:
What is the set of linear constraints for this problem?
Incidentally, I believe this problem is NP-hard (as is Integer Linear Programming in general).
Well if I am correct you just need to ensure
\sum_j x_j.A_ij >= 1 forall i
Let T be the matrix that results from deleting the jth column of A for all j such that x_j = 0.
Then choosing a set of tests that can uniquely distinguish any two diseases is equivalent to ensuring that every row of T is unique.
Observe that two rows k and l are identical if and only if (T[k][j] XOR T[l][j]) = 0 for all j.
So, the constraints we want are
\sum_{j=1}^{m} x_j(A[k][j] XOR A[l][j]) >= 1
for all 1 <= k <= m and 1 <= l <= 1 such that k != l.
Note that the constraints above are linear, since we can just pre-compute the coefficient (A[k][j] XOR A[l][j]).