Is it possible to perform a range addition update, adding a linear function to a max-segment tree? - c++

I was thinking about this and was encountering a lot of bugs while trying to do this. Is it possible?

I believe you are asking whether you can do the following update:
Given the update A B C, you want to add C to all elements from A to B.
The problem is to do the update on the segment tree would normally take O(N * logN) time given that N is the maximum number of elements. However, the key idea about implementing a segment tree is that you would like to suppose range queries and that normally you are not interested in all O(N^2) ranges but rather a much smaller subset of them.
You can enhance the range updates using lazy propagation which would generally mean that you do an update, but you do not update all of the nodes in the segment tree -> you update to some point, but you do not continue futher down the tree since it not needed.
Say that you have updated everything up to a node K which is responsible for the range [10; 30] for example. Later, you do a "get info" query on [20;40]. Obviously, you will have to visit node K, but you are not interested on the whole range, but rather the range [20;30], which is actually its right child.
What you have to do is "push" the update of node K to its left child and then its right child and continue as needed.
In general this would mean that when doing an update, you will do an update only until you find a suitable node(s) for the interval you update, but not go any further. This yields O(logN) time.
Then, when quering you continue propagating the updates down the tree when you reach a node for which you know you have saved some update for later. This also yields O(logN) time.
Some good material: Lazy propagation on segment trees

Related

Clear mark attribute of a node in extract-min operation of Fibonacci Heap

In the DECREASE-KEY operation of Fibonacci Heap, whenever a node is cut from its parent and added to the root list, its mark attribute is set to FALSE. However, in the EXTRACT-MIN operation, the children of the min-node are added to the root list but their mark attributes aren't set to FALSE. Why is there such inconsistency?
Moreover, in the linking operation where a node is made the child of another node, the mark attribute of the new child is set to FALSE. The EXTRACT-MIN operation performs this linking operation multiple times. But in the amortized analysis of EXTRACT-MIN operation described in the CLRS book, the authors claim that the number of marked nodes doesn't change in EXTRACT-MIN operation. They use m(H) to denote the number of marked nodes both before and after EXTRACT-MIN operation. I am quoting the exact line from the book:
The potential before extracting the minimum node is t(H)+2m(H), and
the potential afterward is at most (D(n)+1)+2m(H).
Here D(n) is the maximum degree of any node in an n-node Fibonacci Heap, t(H) is the number of trees in the Fibonacci Heap and m(H) is the number of marked nodes in the Fibonacci Heap.
Isn't this calculation wrong?
Let's take a step back - why do we need mark bits in a Fibonacci heap in the first place?
The idea behind a Fibonacci heap is to make the DECREASE-KEY operation as fast as possible. To that end, the basic strategy of DECREASE-KEY is to take the element whose key is being decreased and to cut it from its parent if the heap property is violated. Unfortunately, if we do that, then we lose the exponential connection between the order of a tree and the number of nodes in the tree. That's a problem because the COALESCE step of an EXTRACT-MIN operation links trees based on their orders and assumes that each tree's order says something about how many nodes it contains. With that connection broken, all the nice runtime bounds we want go away.
As a compromise, we introduce mark bits. If a node loses a child, it gets marked to indicate "something was lost here." Once a marked node loses a second child, then it gets cut from its parent. Over time, if you do a huge number of cuts in a single tree, eventually the cuts propagate up to root of the tree, decreasing the tree order. That makes the tree behave differently during a COALESCE step.
With that said, the important detail here is that mark bits are only relevant for non-root nodes. If a node is a root of a tree and it loses a child, this immediately changes its order in a way that COALESCE will notice. Therefore, you can essentially ignore the mark bit of any tree root, since it never comes up. It's probably a wise idea to clear a node's mark bit before moving it up to the root list just so you don't have to clear it later, but that's more of an implementation detail than anything else.

Time complexity of operations on Order Statistics Tree and 1D points problem

I ran into the following interview question:
We need a data structure to keep n points on the X-axis such that we get efficient implementations of
Insert(x), Delete(x) and Find (a, b) (giving the number of points in
the interval [a, b]). Assume that the maximum number returned by Find(a, b) is k.
We can create a data structure that performs the three operations in O(log n)
We can create a data structure that performs Insert and Delete in O(log n) and Find in O(k + log n).
I know from general information that Find is like a Range on 1D points (but for counting elements in this question, i.e we need the number of elements). If we use for example an AVL tree, then we get the time complexities of option (2).
But I was surprised when told that (1) is the correct answer. Why is (1) the right answer?
The answer is indeed (1).
The idea of an AVL tree is fine, and your conclusions are right. But you can extend the AVL tree such that each node has one extra property: the number of values that precede the node's own value. You would have to take care in the AVL operations (including rotations) that this extra property is kept up to date. But this can be done with a constant overhead, so it does not impact the time complexities of Insert or Delete.
Then Find could just search the node with value a (or the one with the greatest value less than a), and do the same for value b. From both nodes that you find you get the extra property. The subtraction of these two will give the required result. There are some boundary cases to take into consideration, like when a is present in the tree, then that node itself should be counted too, otherwise not. It may be that no node is found with a value less than or equal to a. Then the missing property should be taken as a 0 in the subtraction.
Clearly this makes Find independent of its return value (up to k). The two binary searches give it a time complexity of O(logn).

What is the most efficient data structure for designing a PRIM algorithm?

I am designing a Graph in c++ using a hash table for its elements. The hashtable is using open addressing and the Graph has no more than 50.000 edges. I also designed a PRIM algorithm to find the minimum spanning tree of the graph. My PRIM algorithm creates storage for the following data:
A table named Q to put there all the nodes in the beginning. In every loop, a node is visited and in the end of the loop, it's deleted from Q.
A table named Key, one for each node. The key is changed when necessary (at least one time per loop).
A table named Parent, one for each node. In each loop, a new element is inserted in this table.
A table named A. The program stores here the final edges of the minimum spanning tree. It's the table that is returned.
What would be the most efficient data structure to use for creating these tables, assuming the graph has 50.000 edges?
Can I use arrays?
I fear that the elements for every array will be way too many. I don't even consider using linked lists, of course, because the accessing of each element will take to much time. Could I use hash tables?
But again, the elements are way to many. My algorithm works well for Graphs consisting of a few nodes (10 or 20) but I am sceptical about the situation where the Graphs consist of 40.000 nodes. Any suggestion is much appreciated.
(Since comments were getting a bit long): The only part of the problem that seems to get ugly for very large size, is that every node not yet selected has a cost and you need to find the one with lowest cost at each step, but executing each step reduces the cost of a few effectively random nodes.
A priority queue is perfect when you want to keep track of lowest cost. It is efficient for removing the lowest cost node (which you do at each step). It is efficient for adding a few newly reachable nodes, as you might on any step. But in the basic design, it does not handle reducing the cost of a few nodes that were already reachable at high cost.
So (having frequent need for a more functional priority queue), I typically create a heap of pointers to objects and in each object have an index of its heap position. The heap methods all do a callback into the object to inform it whenever its index changes. The heap also has some external calls into methods that might normally be internal only, such as the one that is perfect for efficiently fixing the heap when an existing element has its cost reduced.
I just reviewed the documentation for the std one
http://en.cppreference.com/w/cpp/container/priority_queue
to see if the features I always want to add were there in some form I hadn't noticed before (or had been added in some recent C++ version). So far as I can tell, NO. Most real world uses of priority queue (certainly all of mine) need minor extra features that I have no clue how to tack onto the standard version. So I have needed to rewrite it from scratch including the extra features. But that isn't actually hard.
The method I use has been reinvented by many people (I was doing this in C in the 70's, and wasn't first). A quick google search found one of many places my approach is described in more detail than I have described it.
http://users.encs.concordia.ca/~chvatal/notes/pq.html#heap

Prim's algorithm for dynamic locations

Suppose you have an input file:
<total vertices>
<x-coordinate 1st location><y-coordinate 1st location>
<x-coordinate 2nd location><y-coordinate 2nd location>
<x-coordinate 3rd location><y-coordinate 3rd location>
...
How can Prim's algorithm be used to find the MST for these locations? I understand this problem is typically solved using an adjacency matrix. Any references would be great if applicable.
If you already know prim, it is easy. Create adjacency matrix adj[i][j] = distance between location i and location j
I'm just going to describe some implementations of Prim's and hopefully that gets you somewhere.
First off, your question doesn't specify how edges are input to the program. You have a total number of vertices and the locations of those vertices. How do you know which ones are connected?
Assuming you have the edges (and the weights of those edges. Like #doomster said above, it may be the planar distance between the points since they are coordinates), we can start thinking about our implementation. Wikipedia describes three different data structures that result in three different run times: http://en.wikipedia.org/wiki/Prim's_algorithm#Time_complexity
The simplest is the adjacency matrix. As you might guess from the name, the matrix describes nodes that are "adjacent". To be precise, there are |v| rows and columns (where |v| is the number of vertices). The value at adjacencyMatrix[i][j] varies depending on the usage. In our case it's the weight of the edge (i.e. the distance) between node i and j (this means that you need to index the vertices in some way. For instance, you might add the vertices to a list and use their position in the list).
Now using this adjacency matrix our algorithm is as follows:
Create a dictionary which contains all of the vertices and is keyed by "distance". Initially the distance of all of the nodes is infinity.
Create another dictionary to keep track of "parents". We use this to generate the MST. It's more natural to keep track of edges, but it's actually easier to implement by keeping track of "parents". Note that if you root a tree (i.e. designate some node as the root), then every node (other than the root) has precisely one parent. So by producing this dictionary of parents we'll have our MST!
Create a new list with a randomly chosen node v from the original list.
Remove v from the distance dictionary and add it to the parent dictionary with a null as its parent (i.e. it's the "root").
Go through the row in the adjacency matrix for that node. For any node w that is connected (for non-connected nodes you have to set their adjacency matrix value to some special value. 0, -1, int max, etc.) update its "distance" in the dictionary to adjacencyMatrix[v][w]. The idea is that it's not "infinitely far away" anymore... we know we can get there from v.
While the dictionary is not empty (i.e. while there are nodes we still need to connect to)
Look over the dictionary and find the vertex with the smallest distance x
Add it to our new list of vertices
For each of its neighbors, update their distance to min(adjacencyMatrix[x][neighbor], distance[neighbor]) and also update their parent to x. Basically, if there is a faster way to get to neighbor then the distance dictionary should be updated to reflect that; and if we then add neighbor to the new list we know which edge we actually added (because the parent dictionary says that its parent was x).
We're done. Output the MST however you want (everything you need is contained in the parents dictionary)
I admit there is a bit of a leap from the wikipedia page to the actual implementation as outlined above. I think the best way to approach this gap is to just brute force the code. By that I mean, if the pseudocode says "find the min [blah] such that [foo] is true" then write whatever code you need to perform that, and stick it in a separate method. It'll definitely be inefficient, but it'll be a valid implementation. The issue with graph algorithms is that there are 30 ways to implement them and they are all very different in performance; the wikipedia page can only describe the algorithm conceptually. The good thing is that once you implement it some way, you can find optimizations quickly ("oh, if I keep track of this state in this separate data structure, I can make this lookup way faster!"). By the way, the runtime of this is O(|V|^2). I'm too lazy to detail that analysis, but loosely it's because:
All initialization is O(|V|) at worse
We do the loop O(|V|) times and take O(|V|) time to look over the dictionary to find the minimum node. So basically the total time to find the minimum node multiple times is O(|V|^2).
The time it takes to update the distance dictionary is O(|E|) because we only process each edge once. Since |E| is O(|V|^2) this is also O(|V|^2)
Keeping track of the parents is O(|V|)
Outputting the tree is O(|V| + |E|) = O(|E|) at worst
Adding all of these (none of them should be multiplied except within (2)) we get O(|V|^2)
The implementation with a heap is O(|E|log(|V|) and it's very very similar to the above. The only difference is that updating the distance is O(log|V|) instead of O(1) (because it's a heap), BUT finding/removing the min element is O(log|V|) instead of O(|V|) (because it's a heap). The time complexity is quite similar in analysis and you end up with something like O(|V|log|V| + |E|log|V|) = O(|E|log|V|) as desired.
Actually... I'm a bit confused why the adjacency matrix implementation cares about it being an adjacency matrix. It could just as well be implemented using an adjacency list. I think the key part is how you store the distances. I could be way off in my implementation outlined above, but I am pretty sure it implements Prim's algorithm is satisfies the time complexity constraints outlined by wikipedia.

How to repeatedly insert elements into a sorted list fast

I do not have formal CS training, so bear with me.
I need to do a simulation, which can abstracted away to the following (omitting the details):
We have a list of real numbers representing the times of events. In
each step, we
remove the first event, and
as a result of "processing" it, a few other events may get inserted into the list at a strictly later time
and repeat this many times.
Questions
What data structure / algorithm can I use to implement this as efficiently as possible? I need to increase the number of events/numbers in the list significantly. The priority is to make this as fast as possible for a long list.
Since I'm doing this in C++, what data structures are already available in the STL or boost that will make it simple to implement this?
More details:
The number of events in the list is variable, but it's guaranteed to be between n and 2*n where n is some simulation parameter. While the event times are increasing, the time-difference of the latest and earliest events is also guaranteed to be less than a constant T. Finally, I suspect that the density of events in time, while not constant, also has an upper and lower bound (i.e. all the events will never be strongly clustered around a single point in time)
Efforts so far:
As the title of the question says, I was thinking of using a sorted list of numbers. If I use a linked list for constant time insertion, then I have trouble finding the position where to insert new events in a fast (sublinear) way.
Right now I am using an approximation where I divide time into buckets, and keep track of how many event are there in each bucket. Then process the buckets one-by-one as time "passes", always adding a new bucket at the end when removing one from the front, thus keeping the number of buckets constant. This is fast, but only an approximation.
A min-heap might suit your needs. There's an explanation here and I think STL provides the priority_queue for you.
Insertion time is O(log N), removal is O(log N)
It sounds like you need/want a priority queue. If memory serves, the priority queue adapter in the standard library is written to retrieve the largest items instead of the smallest, so you'll have to specify that it use std::greater for comparison.
Other than that, it provides just about exactly what you've asked for: the ability to quickly access/remove the smallest/largest item, and the ability to insert new items quickly. While it doesn't maintain all the items in order, it does maintain enough order that it can still find/remove the one smallest (or largest) item quickly.
I would start with a basic priority queue, and see if that's fast enough.
If not, then you can look at writing something custom.
http://en.wikipedia.org/wiki/Priority_queue
A binary tree is always sorted and has faster access times than a linear list. Search, insert and delete times are O(log(n)).
But it depends whether the items have to be sorted all the time, or only after the process is finished. In the latter case a hash table is probably faster. At the end of the process you then would copy the items to an array or a list and sort it.