Heap height includes root level - heap

Is heap root, included in count of heap height?
Heap height should represent number of levels of the heap structure.
According to formulas
MIN number of elements = 2 power (h-1) ,
MAX number of elements = 2 power h -1
root level is counted to heap height. Although, I found some descriptions where that's not the case.
-Q= I'm interested if there is any rule when root shouldn't be included
to count of the heap height.
Also, I already found the answer that in heap, if both children have same value, and parent element should be swapped, it is safe to swap it by either (there is no preference left/right child). Can someone confirm this, so I can completely discard any doubts.
UPDATE:
According to definition of the height , height of the root is 0, then correct formulas for min and max number of elements are:
MIN=2 power h
MAX=number of elements = 2 power (h+1) -1.
Officially by definition of the tree height root level isn't included in the height count. Thanks trincot for help with resolving this doubt.
Although this is correct way to address heap height, there might be some resources online that include root level to the height count (which is wrong by official definition).

As a heap data structure is a tree with specific properties (specifically, it is a rooted, complete, binary tree, whose values obey the heap property), the definition of the height of a heap is the same as the definition of the height of a rooted tree.
Wikipedia defines it as follows:
The height of a node is the length of the longest downward path to a leaf from that node. The height of the root is the height of the tree. [...] a tree with only a single node (hence both a root and leaf) has depth and height zero. Conventionally, an empty tree (tree with no nodes, if such are allowed) has height −1.
With this definition the height of a tree is one less than the number of levels in a tree.
Almost always this is the definition used. But in rare cases you'll find a definition where the height is the number of levels. For instance here. But this is a less popular definition.
Here is an argument in defence of the more popular definition:
Imagine the tree to be a pieced together set of ropes, each piece of rope having the same size (e.g. 10cm), and the places where the ends are tied together, are the vertices. Now pick up the "root" and hold it high up in the air. The ropes all hang vertically, all forming one vertical column. Now measure the distance from the root to the lowest point of this hanging structure: that is physically what we would call the height of that structure. It is the sum of lengths of the little connected ropes that form the longest path. This corresponds with the popular definition (as on Wikipedia) of a tree's height. It depends on the number of edges of the longest path, not on the number of nodes on that path.
As to your other question: yes, if a parent should sift down the heap, and both children have the same value, then it is safe to swap it by either.

Related

Obtain lowest sum from matrix, picking one value per row and column

Essentially I have a matrix of floats ranging from 0-1, and I need to find the combination of values with the lowest sum. The kicker is that once a value is selected, no other values from that row or column may be used. All the columns must be used.
In the case the matrix's width is greater than height, it will be padded with 1's to make the matrix square. In the case the height is greater than width, simply not all the rows will be used, but all of the columns must ALWAYS be used.
I have looked into binary trees and Dijkstra's algorithm for this task, but both seem to get far too complex with larger matrices. Ideally I'm looking for an algorithm or implementation which will provide a very good guess in a relatively short amount of time. Anything optimized for c++ would be great!
I think Greedy Approach should work here for the good guess/optimized part.
Put all the elements in an array as a tuple < value, row, column >
Sort the list with <value> parameter of the tuple.
Greedily pick the elements from beginning, and keep track of the used column/row with either using bitset or boolean matrix as suggested #Thomas Mathews.
The total Complexity will be NMlog(NM) where N is the number of rows, and M no. of columns.
Amit's suggestion to change the title actually led me to finding the solution. It is an "Assignment Problem" and the solution is to use the Hungarian algorithm. I knew there had to be something out there already, I just wasn't sure how to phrase it to find the answer until now. Thanks for all the help.
You can follow the Dijkstra algorithm for the shortest path, assuming you are constructing a tree. In the root node you select a length of 0, and for each node you select the next accesible element that gives you the shortest path from the root node, and store that length (from the root) in the node. You'll add at each iteration, for all the leaves, the arc that makes the total length lesser, and will continue until you get a N nodes path (or a bitmask of 0, see below). The first branch of N nodes from the root will be the shortest path. At each node, you can store a bitmap of the already visited nodes (or you can determine it, looking at the parents) as the possible nodes from it are the unvisited ones only. Or you can have a bitmap of the non-visited ones. This will make the search easier, as you'll stop as soon as no bits are on in the mask.
You have not shown any code or intent to solve the problem, so I'll do that same thing (it seems to be some kind of homework, and you seem not have work on it at all by now) This is an academic problem, already shown in many programming courses in relation with Simplex and operations investigation, in object/resource assignment, so there must be plenty literature about it.

Path finding on a large list of nodes? Around 100,000 nodes

I have a list of nodes as 2D coordinate (array of float) and the goal is to find how many nodes are linked to the source node(given).
Two nodes are defined as linked, if the distance between the nodes is less than or equal to 10. Also, if distance between A and B is <= 10, distance between B and C is <= 10 and distance between A and C > 10, even then, A and C are linked as then path would be is A->B->C. So, it is a typical graph search problem in theory.
Here is the problem. I have around 100,000 nodes in a list. Each node is a 2D coordinate point. Since, the list is enormous, conventional traversal and path finding algorithms like DFS or BFS would take up a O(n^2) to construct the adjacency list, which is not ideal and not what I am looking for.
I researched on the internet and found out that Quad Tree or kd Tree probably might be the best to implement in this case. I have made my own Quad Tree class also, I just don't understand how to implement a search algorithm like DFS on it. Or if there is something else that I am missing out on?
A quadtree groups points by splitting 2D space into quarters, either until each point has a quadrant to itself, or until you reach a minimum size, after which you lump all points within the quadrant into a list.
Since you're trying to find all points within a maximum distance of each point in your source list, you don't need to go all the way down to one-point-per-cell. To pick a cutoff, I would do performance tests on some different values, but as a starting point for minimum quadrant size the maximum connection distance for points is probably a good guess.
So now you have all of your points grouped into a tree and you need to know how to actually find nearby ones.
Since the quadtree encodes spatial information, to find points within a certain distance of any given point, you would descend the quadtree and use that spatial information to exclude entire quadrants from your search. To do this, you would check whether the nearest bound of each quadrant is beyond the maximum distance from the point you are searching from. If the closest edge of that quadrant is beyond the maximum distance, then none of the points in that quadrant can possibly be within the maximum distance, so there is no need to explore that part of the tree. (This is similar to how a binary search doesn't need to search parts of a sorted array or tree, because it knows that those parts cannot possibly contain the value being searched for).
Once you get down to the level of the quadtree where you have a single point or list of points, you would do a regular euclidean distance check with those points to see if they were actually within the maximum distance. (Don't forget to check for equality, otherwise you'll find the same point you're searching around).
So, for example, if you were searching for points near one of the points in the bottom-right corner of this image, there would be no need to search the other three top-level quadrants because all three of them would be beyond the maximum distance. This would save you from exploring all of the sub-quadrants in those parts of the tree and avoid doing distance comparisons against all of those points.
If, however, you are searching for a point near the edge of a quadrant, you do need to check neighboring quadrants, because the nearest bound will be close enough that you cannot exclude the possibility of a valid point being in that quadrant.
In your particular case, you would make use of this by building the quadtree once, and then looping over the original list of points and doing the search I described above to find all points near that point. You would then use the found-points to build a connectivity graph, which could be efficiently traversed by Depth/Breadth-First-Search or could be given edge-weights to be used with a more complex, weighted search like Dijkstra's Algorithm or A*.

Binary Tree Questions

Currently studying for an exam, and whilst reading through some notes, I had a few questions.
I know that the height of a Binary Search Tree is Log(n). Does this mean the depth is also Log(n)?
What is maximum depth of a node in a full binary tree with n nodes? This related to the first question; if the height of a Binary Tree is Log(n), would the maximum depth also be Log(n)?
I know that the time complexity of searching for a node in a binary search tree is O(Log(n)), which I understand. However, I read that the worst case time complexity is O(N). In what scenario would it take O(N) time to find an element?
THIS IS A PRIORITY QUEUE/ HEAP QUESTION. In my lecture notes, it says the following statement:
If we use an array for Priority Queues, en-queuing takes O(1) and de-queuing takes O(n). In a sorted Array, en-queue takes O(N) and de-queue takes O(1).
I'm having a hard time understanding this. Can anyone explain?
Sorry for all the questions, really need some clarity on a few of these topics.
Caveat: I'm a little rusty, but here goes ...
Height and depth of a binary tree are synonymous--more or less. height is the maximum depth along any path from root to leaf. But, when you traverse a tree, you have a concept of current depth. root node has depth 0, its children have depth 1, its grandchildren depth 2. If we stop here, the height of the tree is 3, but the maximum depth [we visited] is 2. Otherwise, they are often interchanged when talking about the tree overall.
Before we get to some more of your questions, it's important to note that binary trees come in various flavors. Balanced or unbalanced. With a perfectly balanced tree, all nodes except those at maximum height will have their left/right links non-null. For example, with n nodes in the tree, let n = 1024. Perfectly balanced the height is log2(n) which is 10 (e.g. 1024 == 2^10).
When you search a perfectly balanced tree, the search is O(log2(n)) because starting from the root node, you choose to follow either left or right, and each time you do, you eliminate 1/2 of the nodes. In such a tree with 1024 elements, the depth is 10 and you make 10 such left/right decisions.
Most tree algorithms, when you add a new node, will rebalance the tree on the fly (e.g. AVL or RB (red black)) trees. So, you get a perfectly balanced tree, all the time, more or less.
But ...
Let's consider a really bad algorithm. When you add a new node, it just appends it to the left link on the child with the greatest depth [or the new node becomes the new root]. The idea is fast append, and "we'll rebalance later".
If we search this "bad" tree, if we've added n nodes, the tree looks like a doubly linked list using the parent link and the left link [remember all right links are NULL]. This is linear time search or O(n).
We did this deliberately, but it can still happen with some tree algorithm and/or combinations of data. That is, the data is such that it gets naturally placed on the left link because that's where it's correct to place it based on the algorithm's placement function.
Priority queues are like regular queues except each piece of data has a priority number associated with it.
In an ordinary queue, you just push/append onto the end. When you dequeue, you shift/pop from the front. You never need to insert anything in the middle. Thus, enqueue and dequeue are both O(1) operations.
The O(n) comes from the fact that if you have to do an insertion into the middle of an array, you have to "part the waters" to make space for the element you want to insert. For example, if you need to insert after the first element [which is array[0]], you will be placing the new element at array[1], but first you have to move array[1] to array[2], array[2] to array[3], ... For an array of n, this is O(n) effort.
When removing an element from an array, it is similar, but in reverse. If you want to remove array[1], you grab it, then you must "close the gap" left by your removal by array[1] = array[2], array[2] = array[3], ... Once again, an O(n) operation.
In a sorted array, you just pop off the end. It's the one you want already. Hence O(1). To add an element, its an insertion into the correct place. If your array is 1,2,3,7,9,12,17 and you want to add 6, that's new value for array[4], and you have to move 7,9,12,17 out of the way as above.
Priority queue just appends to the array, hence O(1). But to find the correct element to dequeue, you scan the array array[0], array[1], ... remembering the first element position for a given priority, if you find a better priority, you remember that. When you hit the end, you know which element you need, say it's j. Now you have to remove j from array, and that an O(n) operation as above.
It's slightly more complex than all that, but not by two much.

Prim's algorithm for dynamic locations

Suppose you have an input file:
<total vertices>
<x-coordinate 1st location><y-coordinate 1st location>
<x-coordinate 2nd location><y-coordinate 2nd location>
<x-coordinate 3rd location><y-coordinate 3rd location>
...
How can Prim's algorithm be used to find the MST for these locations? I understand this problem is typically solved using an adjacency matrix. Any references would be great if applicable.
If you already know prim, it is easy. Create adjacency matrix adj[i][j] = distance between location i and location j
I'm just going to describe some implementations of Prim's and hopefully that gets you somewhere.
First off, your question doesn't specify how edges are input to the program. You have a total number of vertices and the locations of those vertices. How do you know which ones are connected?
Assuming you have the edges (and the weights of those edges. Like #doomster said above, it may be the planar distance between the points since they are coordinates), we can start thinking about our implementation. Wikipedia describes three different data structures that result in three different run times: http://en.wikipedia.org/wiki/Prim's_algorithm#Time_complexity
The simplest is the adjacency matrix. As you might guess from the name, the matrix describes nodes that are "adjacent". To be precise, there are |v| rows and columns (where |v| is the number of vertices). The value at adjacencyMatrix[i][j] varies depending on the usage. In our case it's the weight of the edge (i.e. the distance) between node i and j (this means that you need to index the vertices in some way. For instance, you might add the vertices to a list and use their position in the list).
Now using this adjacency matrix our algorithm is as follows:
Create a dictionary which contains all of the vertices and is keyed by "distance". Initially the distance of all of the nodes is infinity.
Create another dictionary to keep track of "parents". We use this to generate the MST. It's more natural to keep track of edges, but it's actually easier to implement by keeping track of "parents". Note that if you root a tree (i.e. designate some node as the root), then every node (other than the root) has precisely one parent. So by producing this dictionary of parents we'll have our MST!
Create a new list with a randomly chosen node v from the original list.
Remove v from the distance dictionary and add it to the parent dictionary with a null as its parent (i.e. it's the "root").
Go through the row in the adjacency matrix for that node. For any node w that is connected (for non-connected nodes you have to set their adjacency matrix value to some special value. 0, -1, int max, etc.) update its "distance" in the dictionary to adjacencyMatrix[v][w]. The idea is that it's not "infinitely far away" anymore... we know we can get there from v.
While the dictionary is not empty (i.e. while there are nodes we still need to connect to)
Look over the dictionary and find the vertex with the smallest distance x
Add it to our new list of vertices
For each of its neighbors, update their distance to min(adjacencyMatrix[x][neighbor], distance[neighbor]) and also update their parent to x. Basically, if there is a faster way to get to neighbor then the distance dictionary should be updated to reflect that; and if we then add neighbor to the new list we know which edge we actually added (because the parent dictionary says that its parent was x).
We're done. Output the MST however you want (everything you need is contained in the parents dictionary)
I admit there is a bit of a leap from the wikipedia page to the actual implementation as outlined above. I think the best way to approach this gap is to just brute force the code. By that I mean, if the pseudocode says "find the min [blah] such that [foo] is true" then write whatever code you need to perform that, and stick it in a separate method. It'll definitely be inefficient, but it'll be a valid implementation. The issue with graph algorithms is that there are 30 ways to implement them and they are all very different in performance; the wikipedia page can only describe the algorithm conceptually. The good thing is that once you implement it some way, you can find optimizations quickly ("oh, if I keep track of this state in this separate data structure, I can make this lookup way faster!"). By the way, the runtime of this is O(|V|^2). I'm too lazy to detail that analysis, but loosely it's because:
All initialization is O(|V|) at worse
We do the loop O(|V|) times and take O(|V|) time to look over the dictionary to find the minimum node. So basically the total time to find the minimum node multiple times is O(|V|^2).
The time it takes to update the distance dictionary is O(|E|) because we only process each edge once. Since |E| is O(|V|^2) this is also O(|V|^2)
Keeping track of the parents is O(|V|)
Outputting the tree is O(|V| + |E|) = O(|E|) at worst
Adding all of these (none of them should be multiplied except within (2)) we get O(|V|^2)
The implementation with a heap is O(|E|log(|V|) and it's very very similar to the above. The only difference is that updating the distance is O(log|V|) instead of O(1) (because it's a heap), BUT finding/removing the min element is O(log|V|) instead of O(|V|) (because it's a heap). The time complexity is quite similar in analysis and you end up with something like O(|V|log|V| + |E|log|V|) = O(|E|log|V|) as desired.
Actually... I'm a bit confused why the adjacency matrix implementation cares about it being an adjacency matrix. It could just as well be implemented using an adjacency list. I think the key part is how you store the distances. I could be way off in my implementation outlined above, but I am pretty sure it implements Prim's algorithm is satisfies the time complexity constraints outlined by wikipedia.

Given an even number of vertices, how to find an optimum set of pairs based on proximity?

The problem:
We have a set of n vertices in 3D euclidean space, and there is an even number of these vertices.
We want to pair them up based on their proximity. In other words, we'd like to be able to find a set of vertex pairs, where the vertices in each pair are as close as possible together.
We want to minimise sacrificing the proximity between the vertices of any other pairs as much as possible in doing this.
I am not looking for the most optimal solution (if it even strictly exists/can be done), just a reasonable one that can be computed relatively quickly.
A relatively awful brute force approach involves choosing a vertex and looping through the rest to find its nearest neighbor and then repeating until there are none left. Of course as we near the end of the list the closest vertex could be very far away, but it is the only choice, therefore this can fail badly on the third point above.
A common approach for this kind of problems (especially if n is large) is to precompute a spatial index structure, such as a kd tree or an octtree and perform the search for nearest neighbors with the help of it. Through the nodes of the octtree, the available point are put into bins, so you can be sure they are mutually close. Also you minimize the number of comparisons.
A sketch of the implementation with an octtree: you need a Node class that stores its bounding box. A derived LeafNode class stores small number of points up to a maximum (e.g. k = 20), that are added with an insert function. A derived NonLeafNode class stores references to 8 subnodes (which may be both Leaf and NonLeafNodes).
The tree is represented by a root node, all insertions and queries start here. The tree is built up by starting with the first k points being inserted into a LeafNode. If the k+1st point is inserted, the bounding box is split into 8 sub boxes and the contained points are sorted into them. The current LeafNode is replaced by one NonLeafNode with 8 subnodes.
This is iterated until all points are in the tree.
For nearest neighbor searches, the tree is traversed starting from the root node by comparing with the bounding box. If the query point is within a node's bounding box, the traversal goes into that node. Note that if you found the nearest candidate, you also need to check with neighboring nodes in the octtree.
For a kdtree implementation check the wikipedia page, looks quite straigthforward.
Since you are not looking for an optimal solution, here's a heuristic you may consider.
For each point p compute two points: the nearest neighbour and the farthest neighbour that are closest and farthest to p respectively. Now let q be the point with the largest farthest neighbour (q is an extreme point in the input). Match q with its nearest neighbour, delete both of them and recursively compute the matching for the remaining points.
This is certainly NOT optimal, but it does seem to do reasonably well on small input sets. If you need an optimal solution you should read about the euclidean matching problem.