Weighted random numbers
My question is sampling weighted random number with additional condition that the weights of each element are dynamically changed frequently.
Suppose there are N elements to pick with different weights.
For static weights, Walker's alias method requires O(N) time to setup the alias but sampling cost is O(1) so it is one of the best to achieve my goal.
And binary search method requires also O(N) to make cumulative array and sampling cost is log(N)
However in my case, because the weights are frequently changed, the time complexity to modifying weights is also important.
So I want to know there are existing library or algorithm with the time complexity for both modifying the data structure and sampling less than O(N).
EDIT While I read the comments, I realize I need to impose additional conditions. Each modification phase, only few numbers(mostly two) of weights are modified, also those modifications does not change the total sum of weight(normalization condition).
If there is a solution, I also want to know if it can be used when the weights are real numbers too.
I'm facing the same problem. I will describe my current plan for solving it, but will be grateful for any other suggestions and/or implementation pointers.
My current plan is to adapt the algorithm for Dynamic Order Statistics, as described in Section 14.1 of "Introduction to Algorithms" by Cormen/Leiserson/Rivest. You put your elements into a balanced binary tree, such as a red-black tree, with weights as keys. You augment the tree so that each node stores the sum of the weights in its subtree. The root then stores the sum of weights in the whole tree, say S. The subtree sums can be updated during tree operations in the same way as subtree sizes for dynamic order statistics. To do weighted sampling, you sample a number in [0..S] uniformly, say x; then search down tree for the node N such that the sum of weights of nodes preceding N in in-order traversal is <x, but the sum plus N's weight is >x -- similar to the OS-Select operation for dynamic order statistics.
I ran into the following interview question:
We need a data structure to keep n points on the X-axis such that we get efficient implementations of
Insert(x), Delete(x) and Find (a, b) (giving the number of points in
the interval [a, b]). Assume that the maximum number returned by Find(a, b) is k.
We can create a data structure that performs the three operations in O(log n)
We can create a data structure that performs Insert and Delete in O(log n) and Find in O(k + log n).
I know from general information that Find is like a Range on 1D points (but for counting elements in this question, i.e we need the number of elements). If we use for example an AVL tree, then we get the time complexities of option (2).
But I was surprised when told that (1) is the correct answer. Why is (1) the right answer?
The answer is indeed (1).
The idea of an AVL tree is fine, and your conclusions are right. But you can extend the AVL tree such that each node has one extra property: the number of values that precede the node's own value. You would have to take care in the AVL operations (including rotations) that this extra property is kept up to date. But this can be done with a constant overhead, so it does not impact the time complexities of Insert or Delete.
Then Find could just search the node with value a (or the one with the greatest value less than a), and do the same for value b. From both nodes that you find you get the extra property. The subtraction of these two will give the required result. There are some boundary cases to take into consideration, like when a is present in the tree, then that node itself should be counted too, otherwise not. It may be that no node is found with a value less than or equal to a. Then the missing property should be taken as a 0 in the subtraction.
Clearly this makes Find independent of its return value (up to k). The two binary searches give it a time complexity of O(logn).
The limitations are of 100.000 (10^5) nodes and 2 or less edges per node
How could we get a maximum independent set for this graph in O(n) or O(n log n) time? Otherwise, it goes out on time. By the way, i just need to know the amount of points integrating the set, not necessarily the set of points itself.
I know of the greedy aproximation that works on O(n) which is picking nodes with the lowest number of degree, adding them to our set and then removing all its neighbors, repeating that until the graph is empty, and this aproximation works for many cases. Thing is, with these restrictions, isn't there any algorithm that always work?
On that class of graphs, if you greedily choose the node with the lowest degree and delete it and its neighbors, then you'll get an optimal solution, in linear time.
Suppose you have an input file:
<total vertices>
<x-coordinate 1st location><y-coordinate 1st location>
<x-coordinate 2nd location><y-coordinate 2nd location>
<x-coordinate 3rd location><y-coordinate 3rd location>
How can Prim's algorithm be used to find the MST for these locations? I understand this problem is typically solved using an adjacency matrix. Any references would be great if applicable.
If you already know prim, it is easy. Create adjacency matrix adj[i][j] = distance between location i and location j
I'm just going to describe some implementations of Prim's and hopefully that gets you somewhere.
First off, your question doesn't specify how edges are input to the program. You have a total number of vertices and the locations of those vertices. How do you know which ones are connected?
Assuming you have the edges (and the weights of those edges. Like #doomster said above, it may be the planar distance between the points since they are coordinates), we can start thinking about our implementation. Wikipedia describes three different data structures that result in three different run times:'s_algorithm#Time_complexity
The simplest is the adjacency matrix. As you might guess from the name, the matrix describes nodes that are "adjacent". To be precise, there are |v| rows and columns (where |v| is the number of vertices). The value at adjacencyMatrix[i][j] varies depending on the usage. In our case it's the weight of the edge (i.e. the distance) between node i and j (this means that you need to index the vertices in some way. For instance, you might add the vertices to a list and use their position in the list).
Now using this adjacency matrix our algorithm is as follows:
Create a dictionary which contains all of the vertices and is keyed by "distance". Initially the distance of all of the nodes is infinity.
Create another dictionary to keep track of "parents". We use this to generate the MST. It's more natural to keep track of edges, but it's actually easier to implement by keeping track of "parents". Note that if you root a tree (i.e. designate some node as the root), then every node (other than the root) has precisely one parent. So by producing this dictionary of parents we'll have our MST!
Create a new list with a randomly chosen node v from the original list.
Remove v from the distance dictionary and add it to the parent dictionary with a null as its parent (i.e. it's the "root").
Go through the row in the adjacency matrix for that node. For any node w that is connected (for non-connected nodes you have to set their adjacency matrix value to some special value. 0, -1, int max, etc.) update its "distance" in the dictionary to adjacencyMatrix[v][w]. The idea is that it's not "infinitely far away" anymore... we know we can get there from v.
While the dictionary is not empty (i.e. while there are nodes we still need to connect to)
Look over the dictionary and find the vertex with the smallest distance x
Add it to our new list of vertices
For each of its neighbors, update their distance to min(adjacencyMatrix[x][neighbor], distance[neighbor]) and also update their parent to x. Basically, if there is a faster way to get to neighbor then the distance dictionary should be updated to reflect that; and if we then add neighbor to the new list we know which edge we actually added (because the parent dictionary says that its parent was x).
We're done. Output the MST however you want (everything you need is contained in the parents dictionary)
I admit there is a bit of a leap from the wikipedia page to the actual implementation as outlined above. I think the best way to approach this gap is to just brute force the code. By that I mean, if the pseudocode says "find the min [blah] such that [foo] is true" then write whatever code you need to perform that, and stick it in a separate method. It'll definitely be inefficient, but it'll be a valid implementation. The issue with graph algorithms is that there are 30 ways to implement them and they are all very different in performance; the wikipedia page can only describe the algorithm conceptually. The good thing is that once you implement it some way, you can find optimizations quickly ("oh, if I keep track of this state in this separate data structure, I can make this lookup way faster!"). By the way, the runtime of this is O(|V|^2). I'm too lazy to detail that analysis, but loosely it's because:
All initialization is O(|V|) at worse
We do the loop O(|V|) times and take O(|V|) time to look over the dictionary to find the minimum node. So basically the total time to find the minimum node multiple times is O(|V|^2).
The time it takes to update the distance dictionary is O(|E|) because we only process each edge once. Since |E| is O(|V|^2) this is also O(|V|^2)
Keeping track of the parents is O(|V|)
Outputting the tree is O(|V| + |E|) = O(|E|) at worst
Adding all of these (none of them should be multiplied except within (2)) we get O(|V|^2)
The implementation with a heap is O(|E|log(|V|) and it's very very similar to the above. The only difference is that updating the distance is O(log|V|) instead of O(1) (because it's a heap), BUT finding/removing the min element is O(log|V|) instead of O(|V|) (because it's a heap). The time complexity is quite similar in analysis and you end up with something like O(|V|log|V| + |E|log|V|) = O(|E|log|V|) as desired.
Actually... I'm a bit confused why the adjacency matrix implementation cares about it being an adjacency matrix. It could just as well be implemented using an adjacency list. I think the key part is how you store the distances. I could be way off in my implementation outlined above, but I am pretty sure it implements Prim's algorithm is satisfies the time complexity constraints outlined by wikipedia.
I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?
You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.
I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms:
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.
A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.
Consider a sequence of n positive real numbers, (ai), and its partial sum sequence, (si). Given a number x ∊ (0, sn], we have to find i such that si−1 < x ≤ si. Also we want to be able to change one of the ai’s without having to update all partial sums. Both can be done in O(log n) time by using a binary tree with the ai’s as leaf node values, and the values of the non-leaf nodes being the sum of the values of the respective children. If n is known and fixed, the tree doesn’t have to be self-balancing and can be stored efficiently in a linear array. Furthermore, if n is a power of two, only 2 n − 1 array elements are required. See Blue et al., Phys. Rev. E 51 (1995), pp. R867–R868 for an application. Given the genericity of the problem and the simplicity of the solution, I wonder whether this data structure has a specific name and whether there are existing implementations (preferably in C++). I’ve already implemented it myself, but writing data structures from scratch always seems like reinventing the wheel to me—I’d be surprised if nobody had done it before.
This is known as a finger tree in functional programming but apparently there are implementations in imperative languages. In the articles there is a link to a blog post explaining an implementation of this data structure in C# which could be useful to you.
Fenwick tree (aka Binary indexed tree) is a data structure that maintains a sequence of elements, and is able to compute cumulative sum of any range of consecutive elements in O(logn) time. Changing value of any single element needs O(logn) time as well.