Balancing KD Tree - c++

So when balancing a KD tree you're supposed to find the median and then put all the elements that are less on the left subtree and those greater on the right. But what happens if you have multiple elements with the same value as the median? Do they go in the left subtree, the right or do you discard them?
I ask because I've tried doing multiple things and it affects the results of my nearest neighbor search algorithm and there are some cases where all the elements for a given section of the tree will all have the exact same value and so I don't know how to split them up in that case.

It does not really matter where you put them. Preferably, keep your tree balanced. So place as many on the left as needed to keep the optimal balance!
If your current search radius touches the median, you will have to check the other part, that's all you need to handle tied objects on the other side. This is usually cheaper than some complex handling of attaching multiple elements anywhere.

When doing a search style algorithm, it is often a good idea to put elements equal to your median on both sides of the median.
One method is to put median equaling elements on the "same side" as where they where before you did your partition. Another method is to put the first one on the left, and the second one on the right, etc.
Another solution is to have a clumping data structure that just "counts" things that are equal instead of storing each one individually. (if they have extra state, then you can store that extra state instead of just a count)
I don't know which is appropriate in your situation.

That depends on your purpose.
For problems such as exact-matching or range search, possibility of repetitions of the same value on both sides will complicate the query and repetition of the same value on both leaves will add to the time-complexity.
A solution is storing all of the medians (the values that are equal to the value of median) on the node, neither left nor right. Most variants of kd-trees store the medians on the internal nodes. If they happen to be many, you may consider utilizing another (k-1)d tree for the medians.

Related

Obtain lowest sum from matrix, picking one value per row and column

Essentially I have a matrix of floats ranging from 0-1, and I need to find the combination of values with the lowest sum. The kicker is that once a value is selected, no other values from that row or column may be used. All the columns must be used.
In the case the matrix's width is greater than height, it will be padded with 1's to make the matrix square. In the case the height is greater than width, simply not all the rows will be used, but all of the columns must ALWAYS be used.
I have looked into binary trees and Dijkstra's algorithm for this task, but both seem to get far too complex with larger matrices. Ideally I'm looking for an algorithm or implementation which will provide a very good guess in a relatively short amount of time. Anything optimized for c++ would be great!
I think Greedy Approach should work here for the good guess/optimized part.
Put all the elements in an array as a tuple < value, row, column >
Sort the list with <value> parameter of the tuple.
Greedily pick the elements from beginning, and keep track of the used column/row with either using bitset or boolean matrix as suggested #Thomas Mathews.
The total Complexity will be NMlog(NM) where N is the number of rows, and M no. of columns.
Amit's suggestion to change the title actually led me to finding the solution. It is an "Assignment Problem" and the solution is to use the Hungarian algorithm. I knew there had to be something out there already, I just wasn't sure how to phrase it to find the answer until now. Thanks for all the help.
You can follow the Dijkstra algorithm for the shortest path, assuming you are constructing a tree. In the root node you select a length of 0, and for each node you select the next accesible element that gives you the shortest path from the root node, and store that length (from the root) in the node. You'll add at each iteration, for all the leaves, the arc that makes the total length lesser, and will continue until you get a N nodes path (or a bitmask of 0, see below). The first branch of N nodes from the root will be the shortest path. At each node, you can store a bitmap of the already visited nodes (or you can determine it, looking at the parents) as the possible nodes from it are the unvisited ones only. Or you can have a bitmap of the non-visited ones. This will make the search easier, as you'll stop as soon as no bits are on in the mask.
You have not shown any code or intent to solve the problem, so I'll do that same thing (it seems to be some kind of homework, and you seem not have work on it at all by now) This is an academic problem, already shown in many programming courses in relation with Simplex and operations investigation, in object/resource assignment, so there must be plenty literature about it.

Hard sorting problem - what type of algorithm should I be using?

The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
-
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)

Binary search through quad-tree

I have a quad tree where the leaf nodes represent pixels. There is a method that prunes the quad tree, and another method that calculates the number of leaves that would remain if the tree were to be pruned. The prune method accepts an integer tolerance which is used as a limit in the difference between nodes to check whether to prune or not. Anyway, I want to write a function that takes one argument leavesLeft. What this should do is calculate the minimum tolerance necessary to ensure that upon pruning, no more than leavesLeft remain in the tree. The hint is to use binary search recursively to do this. My question is that I am unable to make the connection between binary search and this function that I need to write. I am not sure how this would be implemented. I know that the maximum tolerance allowable is 256*256*3=196,608, but apart from that, I dont know how to get started. Can anyone guide me in the right direction?
You want to look for Nick's spatial index quadtree and hilbert curve.
Write a method that just tries all possible tolerance values and checks if that would leave exactly enough nodes.
Write a test case and see if it works.
Don't do step 1. Use a binary search in all possible tolerance values to do same as step 1, but quicker.
If you don't know how to implement a binary search, you can better try it on a simple integer array first. Anyway, if you do step 1, just store the number of leaves left in array (with the tolerance as index). And then execute a binary search on that. To turn this into step 3, notice you don't need the entire array. Simply replace the array by a function that calculates the values and you're done.
Say you plugged in tolerance = 0. Then you'd get an extreme answer like zero leaves left or all the leaves left (not sure how it works from your question). Say you plug in tolerance = 196,608. You'd get an answer at the other extreme. You know the answer you're looking for is somewhere in between.
So you plug in a tolerance number halfway between 0 and 196,608: a tolerance of 98,304. If the number of leaves left is too high, then you know the correct tolerance is somewhere between 0 and 98,304; if it's too low, the correct tolerance is somewhere between 98,304 and 196,608. (Or maybe the high/low parts are reversed; I'm not sure from your question.)
That's binary search. You keep cutting the range of possible values in half by checking the one in the middle. Eventually you narrow it down to the correct tolerance. Of course you'll need to look up binary search in order to implement it correctly.

Choosing N random numbers from a set

I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?
You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.
I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.
A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.

Binary tree number of nodes with a given level

I need to write a program that counts the number of nodes from a certain level given in binary
tree.
I mean < numberofnodes(int level){} >
I tried writing it without any success because I don't how to get to a certain level then go
to count number of nodes.
Do it with a recursive function which only descends to a certain level.
Well, there are many ways you can do this. Best is to have a single dimensional array that keep track of the number of nodes that you add/remove at each level. Considering your requirement that would be the simplest way.
However, if provided with just a binary tree, you WILL have to traverse and go to that many levels and count the nodes, I do not see any other alternative.
To go to a certain level, you will typically need to have a variable called as 'current_depth' which will be tracking the level you are in. Once you reach your level of interest and that the nodes are visited once (typically a In order traversal would suffice) you can increment your count. Hope this helped.
I'm assuming that your binary tree is not necessarily complete (i.e., not every node has two or zero children or this becomes trivial). I'm also assuming that you are supposed to count just the nodes at a certain level, not at that level or deeper.
There are many ways to do what you're asked, but you can think of this as a graph search problem - you are given a starting node (the root of the tree), a way to traverse edges (the child links), and a criteria - a certain distance from the root.
At this point you probably learned graph search algorithms - which algorithm sounds like a natural fit for your task?
In general terms:
Recursion.
In each iteration of the recursion, you need to measure somehow what level are you on, and therefore to know how far down the tree you need to go beyond where you are now.
Recursion parts:
What are you base cases? at what conditions do you say "Okay, time to stop the recursion" ?
How can you count something in a recursion without any global count?
I think the easiest is to simply follow the recursive nature of the tree with an accumulator to track the current level (or maybe the number of levels remaining to be descended).
The non-recursive branch of that function is when you hit the level in question. At that point you simply return 1 (because you found one node at that level).
The recursive branch, simply sums the number of nodes at that level returned from the left and right recursive invocations.
This is the pseudo-code, this assumes the root has level 0
int count(x,currentLevel,desiredLevel)
if currentLevel = desiredLevel
return 1
left = 0
right = 0
if x.left != null
left = count(x.left, currentLevel+1, desiredLevel
if x.right != null
right = count(x.right, currentLevel+1, desiredLevel
return left + right
So to get the number of nodes for level 3 you call
count(root,0,3)