Confused about definition of a 'median' when constructing a kd-Tree

Confused about definition of a 'median' when constructing a kd-Tree - c++

Im trying to build a kd-tree for searching through a set of points, but am getting confused about the use of 'median' in the wikipedia article. For ease of use, the wikipedia article states the pseudo-code of kd-tree construction as:
function kdtree (list of points pointList, int depth)
{
if pointList is empty
return nil;
else
{
// Select axis based on depth so that axis cycles through all valid values
var int axis := depth mod k;
// Sort point list and choose median as pivot element
select median by axis from pointList;
// Create node and construct subtrees
var tree_node node;
node.location := median;
node.leftChild := kdtree(points in pointList before median, depth+1);
node.rightChild := kdtree(points in pointList after median, depth+1);
return node;
}
}
I'm getting confused about the "select median..." line, simply because I'm not quite sure what is the 'right' way to apply a median here.
As far as I know, the median of an odd-sized (sorted) list of numbers is the middle element (aka, for a list of 5 things, element number 3, or index 2 in a standard zero-based array), and the median of an even-sized array is the sum of the two 'middle' elements divided by two (aka, for a list of 6 things, the median is the sum of elements 3 and 4 - or 2 and 3, if zero-indexed - divided by 2.).
However, surely that definition does not work here as we are working with a distinct set of points? How then does one choose the correct median for an even-sized list of numbers, especially for a length 2 list?
I appreciate any and all help, thanks!
-Stephen

It appears to me that you understand the meaning of median, but you are confused with something else. What do you mean be distinct set of points?
The code presented by Wikipedia is a recursive function. You have a set of points, so you create a root node and choose a median of the set. Then you call the function recursively - for the left subtree you pass in a parameter with all the points smaller than the split-value (the median) of the original list, for the right subtree you pass in the equal and larger ones. Then for each subtree a node is created where the same thing happens. It goes like this:
First step (root node):
Original set: 1 2 3 4 5 6 7 8 9 10
Split value (median): 5.5
Second step - left subtree:
Set: 1 2 3 4 5
Split value (median): 3
Second step - right subtree:
Set: 6 7 8 9 10
Split value (median): 8
Third step - left subtree of left subtree:
Set: 1 2
Split value (median): 1.5
Third step - right subtree of left subtree:
Set: 3 4 5
Split value (median): 4
Etc.
So the median is chosen for each node in the tree based on the set of numbers (points, data) which go into that subtree. Hope this helps.

You have to choose an axis with as many element on one side than the other. If the number of points is odd or the points are positioned in such a way that it isn't possible, just choose an axis to give an as even repartition as possible.

Related

Pre-Order Traversals Of All Possible Binary Trees Given In-Order Traversal

I was looking at questions on binary trees, and I came across the following:
Given the in-order traversal of a binary tree, print the pre-order traversals of all possible binary trees satisfying the given in-order traversal.
For e.g, if the in-order traversal is: {4, 5, 7}
The possible trees are:
4 4 5 7 7
\ \ / \ / /
5 7 4 7 4 5
\ / \ /
7 5 5 4
Therefore, the pre-order traversals are:
4 5 7
4 7 5
5 4 7
7 4 5
7 5 4
The solution I came up with:
Traverse the given in-order list. Upon each iteration, select an element from the list and make it the root of our tree. All elements preceding the current one will be part of the left subtree and all elements succeeding it will form the right subtree. We can then recursively do the same for the left and right subtrees.
For instance, in the above example, I begin by selecting 4 as the root of my tree. Now since there are no elements preceding 4, I cannot have a left subtree. I look at the remaining elements. They will form the right subtree. I select 5 to be the root of this subtree. For this, I am only left with one choice: to construct the right subtree of 5 from 7. That gives the first tree of the example.
Now, I keep 4 as the root, and instead of selecting 5, I select 7 as the root of the right subtree of 4. This leads me to the second tree of the example above.
That much is fine. The problem comes up with the code. I've spent quite sometime on translating the above solution to code. But I haven't been completely successful.
This is what I've tried in C++:
void solve(vector<int> inOrder, int beg, int end, string &s, bool &flag)
{
for(int i = beg; i <= end; ++i)
{
s += to_string(inOrder[i]);
flag = false;
solve(inOrder, beg, i - 1, s, flag);
solve(inOrder, i + 1, end, s, flag);
if(s.size() == inOrder.size()) {flag = true; cout << s << endl;}
if(s.size() && flag) {s.pop_back();}
}
}
I use a string to store the current permutation of elements in the pre-order traversal. Elements are appended to the string when a permutation satisfies the in-order traversal.
Naturally, elements must be subsequently removed from the string to make way for other permutations. However, I haven't been able to figure out when to remove an element. In the above code, I append an element the first time it is encountered, and I start removing elements when the string has size equal to that of the in-order list.
So, let's say I begin with 4. I append 4 to the string. I do the same for 5 and then 7. Now the string size equals the total number of elements. So I remove the last one. My string is now 45. Since there are no more possible combinations with the current string, I remove 5. I'm left with 4. Now, I can append 7 and then 5, leading to 475. This works fine in this case, but I haven't been able to make it work for other combinations. It fails when I begin by making 5 as the root, instead of 4.
So my question is, how exactly should I proceed to solve the above problem? Am I even doing it the right way? Or should I give up this approach and think of something else? If yes, what direction should I proceed in?
I'm not looking for an exact solution, only a hint as to what I'm missing or what I could do better.

Quicksort w/ "median of three" pivot selection: Understanding the process

We're being introduced to Quicksort (with arrays) in our class. I've been running in to walls trying to wrap my head around how they want our Quicksort assignment to work with the "median of three" pivot selection method. I just need a high-level explanation of how it all works. Our text doesn't help and I'm having a hard time Googling to find a clear explanation.
This is what I think to understand so far:
The "median of three" function takes the elements in index 0(first), array_end_index(last), and (index 0 + array_end_index)/2(middle). The index with the median value of those 3 is calculated. The corresponding index is returned.
Function parameters below:
/* #param left
* the left boundary for the subarray from which to find a pivot
* #param right
* the right boundary for the subarray from which to find a pivot
* #return
* the index of the pivot (middle index); -1 if provided with invalid input
*/
int QS::medianOfThree(int left, int right){}
Then, in the "partition" function, the number whose index matches with the one returned by the "median of three" function acts as the pivot. My assignment states that, in order to proceed with partitioning the array, the pivot must lie in-between the left and right boundaries. The problem is, our "median of three" function returned one of three indices: the first, the middle, or the last index. Only one of those three indices(middle) could ever be "in-between" anything.
Function parameters below:
/* #param left
* the left boundary for the subarray to partition
* #param right
* the right boundary for the subarray to partition
* #param pivotIndex
* the index of the pivot in the subarray
* #return
* the pivot's ending index after the partition completes; -1 if
* provided with bad input
*/
int QS::partition(int left, int right, int pivotIndex){}
What am I misunderstanding?
Here are the entire descriptions of the functions:
/*
* sortAll()
*
* Sorts elements of the array. After this function is called, every
* element in the array is less than or equal its successor.
*
* Does nothing if the array is empty.
*/
void QS::sortAll(){}
/*
* medianOfThree()
*
* The median of three pivot selection has two parts:
*
* 1) Calculates the middle index by averaging the given left and right indices:
*
* middle = (left + right)/2
*
* 2) Then bubble-sorts the values at the left, middle, and right indices.
*
* After this method is called, data[left] <= data[middle] <= data[right].
* The middle index will be returned.
*
* Returns -1 if the array is empty, if either of the given integers
* is out of bounds, or if the left index is not less than the right
* index.
*
* #param left
* the left boundary for the subarray from which to find a pivot
* #param right
* the right boundary for the subarray from which to find a pivot
* #return
* the index of the pivot (middle index); -1 if provided with invalid input
*/
int QS::medianOfThree(int left, int right){}
/*
* Partitions a subarray around a pivot value selected according to
* median-of-three pivot selection.
*
* The values which are smaller than the pivot should be placed to the left
* of the pivot; the values which are larger than the pivot should be placed
* to the right of the pivot.
*
* Returns -1 if the array is null, if either of the given integers is out of
* bounds, or if the first integer is not less than the second integer, OR IF THE
* PIVOT IS NOT BETWEEN THE TWO BOUNDARIES.
*
* #param left
* the left boundary for the subarray to partition
* #param right
* the right boundary for the subarray to partition
* #param pivotIndex
* the index of the pivot in the subarray
* #return
* the pivot's ending index after the partition completes; -1 if
* provided with bad input
*/
int QS::partition(int left, int right, int pivotIndex){}

Start with understanding quicksort first, median-of-three next.
To perform a quicksort you:
Pick an item from the array you are sorting (any item will do, but which is the best one to go for we'll come back to).
Reorder the array so that all items less than the one you picked are before it in the array, and all of those greater than it are after it.
Recursively do the above to the sets before and after the item you picked.
Step 2 is called the "partition operation". Consider if you had the following:
3 2 8 4 1 9 5 7 6
Now say you picked the first of those numbers as your pivot element (the one we picked at step 1). After we apply step 2 we end up with something like:
2 1 3 4 8 9 5 7 6
The value 3 is now in the correct place, and every element is on the correct side of it. If we now sort the left-hand side we end up with:
1 2 3 4 8 9 5 7 6.
Now, let's consider just the elements to the right of it:
4 8 9 5 7 6.
If we pick 4 to pivot next we end up changing nothing as it was in the correct position to begin with. To set of elements to the left of it is empty, so nothing to do here. We now need to sort the set:
8 9 5 7 6.
If we use 8 as our pivot we could end up with:
5 7 6 8 9.
The 9 now on its right is only one element, so obviously already sorted. The 5 7 6 is left to sort. If we pivot on the 5 we end up leaving it alone, and we just need to sort 7 6 into 6 7.
Now, considering all those changes in the wider context, what we have ended up with is:
1 2 3 4 5 6 7 8 9.
So to summarise again, quicksort picks one item, moves elements around it so that they are all correctly positioned relative to that one item, and then does the same thing recursively with the remaining two sets until there are no unsorted blocks left, and everything is sorted.
Let's come back to the matter I fudged over there when I said "any item will do". While it's true that any item will do, which item you do pick will affect the performance. If you are lucky you will end up doing this in a number of operations proportional to n log n where n is the number of elements. If you're just reasonably lucky it'll be a slightly bigger number still proportional to n log n. If you're really unlucky it'll be a number proportional to a number proportional to n2.
So which is the best item to pick? The best number is the item that will end up in the middle after you have done the partition operation. But we don't know what item that is, because to find the middle item we have to sort all of the items, and that's what we were trying to do in the first place.
So, there are a few strategies we can take:
Just go for the first one, because, meh, why not?
Go for the middle one, because maybe the array is already sorted or nearly sorted for some reason, and if not it's not any worse a choice than any other.
Pick a random one.
Pick the first one, middle one and last one, and go for the median of those three, because it's at least going to be the best of those three options.
Pick the median-of-three for the first third of the array, the median-of-three of the second third, the median-of-three of the last third and then finally go with the median of those three medians.
These have different pros and cons, but generally speaking each of those options gives better results in picking the best pivot than the previous, but at the cost of spending more time and effort on picking that pivot. (Random has the further advantage of beating cases where someone is deliberately trying to create data that you will have worse-case behaviour on, as part of some sort of DoS attack).
My assignment states that, in order to proceed with partitioning the array, the pivot must lie in-between the left and right boundaries.
Yes. Consider above again when we had sorted 3 into the correct position, and sorted the left:
1 2 3 4 8 9 5 7 6.
Now, here we need to sort the range 4 8 9 5 7 6. The boundary is the line between the 3 and the 4 and the line between the 6 and the end of the array (Or another way of looking at it, the boundary is the 4 and the 6, but it's an inclusive boundary including those items). To three we pick is hence the 4 (first) the 6 (last) and either the 9 or the 5 depending on whether we round up or down in dividing the count by 2 (we probably round down as that's usual in integer division so the 9). All of these are inside the boundary of the partition we are currently sorting. Our median-of-three is hence 6 (or if we did round up, we'd have gone for the 5).
(Incidentally, a magically perfect pivot-chooser that always picked the best pivot just would have picked either the 6 or the 7, so picking 6 is pretty good here, though there are still times when median-of-three will be unlucky and end up picking the 3rd worse option, or perhaps even an arbitrary choice out of 3 equal elements all of which were the worse. It's just much less likely to happen than with other approaches).

The documentation for medianOfThree says:
* 2) Then bubble-sorts the values at the left, middle, and right indices.
*
* After this method is called, data[left] <= data[middle] <= data[right].
* The middle index will be returned.
So you description does not match the documentation. What you are doing is sorting the first, middle and last elements in-place in your data, and always returning the middle index.
So, it is guaranteed that the pivot index is in between the boundaries (unless when middle ends up bein in the boundary...).
Even so, there's nothing wrong with pivoting the boundaries...

Calculating the "median of three" is sort of a way to get a pseudo median element in your array, and having that index equal to your partition. Its a simple way to get a rough estimate of what the median of the array would be, leading to better performance.
Why would this be useful? Because in theory, you want to have this partition value to be the true median of your array, so when you do quicksort on this array, the pivot would have divided this array equally and enables the nice O(NlogN) sorting time that quick sort gives you.
Example: Your array is:
[5,3,1,7,9]
The median of three would look at 5, 1, and 9. The median is obviously 5, so this is the pivot value we want to consider for the partition function of quick sort. What you can do next is swap this index with the last and get
[9,3,1,7,5]
Now we attempt to have all values that are smaller than 5 on the left of the middle, all values bigger than five on the right of the middle. We now get
[1,3,7,9,5]
Swap the last element (where we stored the partition value) with the middle
[1,3,5,9,7]
And thats the idea of using the middle of 3. Imagine if our partition was 1 or 9. You could imagine that this array we would get would not be a good case for quick sort.

Finding PostOrder traversal from LevelOrder traversal without constructing the tree

Given a binary tree where value of each internal node is 1 and leaf node is 0. Every internal
node has exactly two children. Now given level order traversal of this tree return postorder
traversal of the same tree.
This question can be easily solved if I construct a tree and then do its postorder traversal. Although it is O(n) time. But is it possible to print postOrder traversal without building up the tree.

It's definitely possible.
Considering it's a Full Binary Tree, once the number of nodes is determined, theoretically, the shape of tree is unique.
Deem the level order traversal as an array, for example, 1 2 3 4 5 6 7.
It represents such tree:
1
2 3
4 5 6 7
What you want to get is the post order traversal: 4 5 2 6 7 3 1.
The first step is calculate how deep the tree was:
depth = ceil(log(2, LevelOrderArray.length)) // =3 for this example
after that, set up a counter = 0, and extract nodes from the bottom level of the original array, one by one:
for(i=0, i<LevelOrderArray.length, i++){
postOrderArray[i] = LevelOrderArray[ 2 ^ (depth-1) +i ] //4,5,....
counter++; //1,2,.....
}
But notice that once the counter can be divided by 2, that means you need to retrieve another node from upper level:
if(counter mod 2^1 == 0)
postOrderArray[i] = LevelOrderArray[ 2 ^ (depth -2) + (++i) ] // =2 here,
//which is the node you need after 4 and 5, and get 3 after 6 and 7 at the 2nd round
Don't ++ the counter here, because the counter represents how many nodes you retrieved from the bottom level.
Every time 2^2 = 4 nodes was pop out, retrieve another node from 3rd level (counting from bottom)
if(counter mod 2^2 == 0)
postOrderArray[i] = LevelOrderArray[ 2 ^ (depth -3) + (++i) ] // =1
Every time 2^3 = 8 nodes was pop out, again, retrieve another node from 4th level
.... until the loop is finished.
It's not strict C++ code, only the concept. If you fully understand the algorithm, the value of tree nodes doesn't matter at all, even though there are all 0 and 1. The point is although you didn't build up the tree in program, but build it up in your mind instead, and convert it into algorithm.

Calculating the vertical sum in binary tree

This is a code i came across which calculates the vertical sum in binary tree. As the code doesn't have any documentation at all ,i am unable to understand how does it actually works and what exactly the condition if(base==hd) does?
Help needed :)
void vertical_line(int base,int hd,struct node * node)
{
if(!node) return;
vertical_line(base-1,hd,node->left);
if(base==hd) cout<<node->data<<" ";
vertical_line(base+1,hd,node->right);
}
void vertical_sum(struct node * node)
{
int l=0,r=0;
struct node * temp=node;
while(temp->left){
--l;temp=temp->left;
}
temp=node;
while(temp->right){
++r;temp=temp->right;
}
for(int i=l;i<=r;i++)
{
cout<<endl<<"VERTICAL LINE "<<i-l+1<<" : ";
vertical_line(0,i,node);
}
}

It is trying to display the tree in vertical order - let's try to understand what is vertical order
Take the example of following tree
4
2 6
1 3 5 7
Following is the distribution of nodes across vertical lines
Vertical Line 1 - 1
Vertical Line 2 - 2
Vertical Line 3 - 3,4,5
Vertical Line 4 - 6
Vertical Line 5 - 7
How did we decide that 3,4,5 are part of vertical line 3. We need to find horizontal distance of nodes from root to decide if they belong to same line or not. We start with root which has horizontal distance of zero. If we move left then we need to decrement the distance of parent by 1 and if we move to right we need to increment the distance of parent by 1. Same applies to every node in the tree i.e if parent has horizontal distance of d, then it's left child's distance is d-1 and right child's distance is d+1
In this case node 4 has distance 0. Node 2 is left child of 4, so it's distance is -1 (decrement by 1). Node 6 is right child of 4, so it's distance is 1 (increment by 1).
Node 2 has distance -1. Node 3 is right child of 2, so it's distance is 0 (increment by 1)
Similarly for Node 5. Nodes 3,4,5 have horizontal distance of zero so they fall on the same vertical line
Now coming to your code
while(temp->left){
--l;temp=temp->left;
}
In this loop, you are computing distance of farthest node from root on left hand side (This doesn't work all the time, we will discuss that later). Every time you move left, you are decrementing value of l by 1.
while(temp->right){
++r;temp=temp->right;
}
With logic similar to above, you are distance of computing farthest node from root on right hand side
Now you know the farthest distances on left hand and right hand sides, you are displaying the nodes vertically
for(int i=l;i<=r;i++)
{
cout<<endl<<"VERTICAL LINE "<<i-l+1<<" : ";
vertical_line(0,i,node);
}
Every iteration in above loop will display the nodes on vertical line. (This is not efficient). You are calling vertical_line method for every line
void vertical_line(int base,int hd,struct node * node)
{
if(!node) return;
vertical_line(base-1,hd,node->left);
if(base==hd) cout<<node->data<<" ";
vertical_line(base+1,hd,node->right);
}
Above method will print the nodes falling on the line hd. This method iterates over entire tree, computing the distance for every node i.e base contains the value of horizontal distance of a node. If a node is part of vertical line hd, then base becomes equal to hd i.e base = hd which is when your code is printing the value of the node

Fastest way to find median in dynamically growing range

Can anyone suggest any methods or link to implementations of fast median finding for dynamic ranges in c++? For example, suppose that for iterations in my program the range grows, and I want to find the median at each run.
Range
4
3,4
8,3,4
2,8,3,4
7,2,8,3,4
So the above code would ultimately produce 5 median values for each line.

The best you can get without also keeping track of a sorted copy of your array is re-using the old median and updating this with a linear-time search of the next-biggest value. This might sound simple, however, there is a problem we have to solve.
Consider the following list (sorted for easier understanding, but you keep them in an arbitrary order):
1, 2, 3, 3, 3, 4, 5
// *
So here, the median is 3 (the middle element since the list is sorted). Now if you add a number which is greater than the median, this potentially "moves" the median to the right by one half index. I see two problems: How can we advance by one half index? (Per definition, the median is the mean value of the next two values.) And how do we know at which 3 the median was, when we only know the median was 3?
This can be solved by storing not only the current median but also the position of the median within the numbers of same value, here it has an "index offset" of 1, since it's the second 3. Adding a number greater than or equal to 3 to the list changes the index offset to 1.5. Adding a number less than 3 changes it to 0.5.
When this number becomes less than zero, the median changes. It also have to change if it goes beyond the count of equal numbers (minus 1), in this case 2, meaning the new median is more than the last equal number. In both cases, you have to search for the next smaller / next greater number and update the median value. To always know what the upper limit for the index offset is (in this case 2), you also have to keep track of the count of equal numbers.
This should give you a rough idea of how to implement median updating in linear time.

I think you can use a min-max-median heap. Each time when the array is updated, you just need log(n) time to find the new median value. For a min-max-median heap, the root is the median value, the left tree is a min-max heap, while the right side is a max-min heap. Please refer the paper "Min-Max Heaps and Generailized Priority Queues" for the details.

Fins some code below, I have reworked this stack to give your necessary output
private void button1_Click(object sender, EventArgs e)
{
string range = "7,2,8,3,4";
decimal median = FindMedian(range);
MessageBox.Show(median.ToString());
}
public decimal FindMedian(string source)
{
// Create a copy of the input, and sort the copy
int[] temp = source.Split(',').Select(m=> Convert.ToInt32(m)).ToArray();
Array.Sort(temp);
int count = temp.Length;
if (count == 0) {
throw new InvalidOperationException("Empty collection");
}
else if (count % 2 == 0) {
// count is even, average two middle elements
int a = temp[count / 2 - 1];
int b = temp[count / 2];
return (a + b) / 2m;
}
else {
// count is odd, return the middle element
return temp[count / 2];
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js