I have a quad tree where the leaf nodes represent pixels. There is a method that prunes the quad tree, and another method that calculates the number of leaves that would remain if the tree were to be pruned. The prune method accepts an integer tolerance which is used as a limit in the difference between nodes to check whether to prune or not. Anyway, I want to write a function that takes one argument leavesLeft. What this should do is calculate the minimum tolerance necessary to ensure that upon pruning, no more than leavesLeft remain in the tree. The hint is to use binary search recursively to do this. My question is that I am unable to make the connection between binary search and this function that I need to write. I am not sure how this would be implemented. I know that the maximum tolerance allowable is 256*256*3=196,608, but apart from that, I dont know how to get started. Can anyone guide me in the right direction?
You want to look for Nick's spatial index quadtree and hilbert curve.
Write a method that just tries all possible tolerance values and checks if that would leave exactly enough nodes.
Write a test case and see if it works.
Don't do step 1. Use a binary search in all possible tolerance values to do same as step 1, but quicker.
If you don't know how to implement a binary search, you can better try it on a simple integer array first. Anyway, if you do step 1, just store the number of leaves left in array (with the tolerance as index). And then execute a binary search on that. To turn this into step 3, notice you don't need the entire array. Simply replace the array by a function that calculates the values and you're done.
Say you plugged in tolerance = 0. Then you'd get an extreme answer like zero leaves left or all the leaves left (not sure how it works from your question). Say you plug in tolerance = 196,608. You'd get an answer at the other extreme. You know the answer you're looking for is somewhere in between.
So you plug in a tolerance number halfway between 0 and 196,608: a tolerance of 98,304. If the number of leaves left is too high, then you know the correct tolerance is somewhere between 0 and 98,304; if it's too low, the correct tolerance is somewhere between 98,304 and 196,608. (Or maybe the high/low parts are reversed; I'm not sure from your question.)
That's binary search. You keep cutting the range of possible values in half by checking the one in the middle. Eventually you narrow it down to the correct tolerance. Of course you'll need to look up binary search in order to implement it correctly.
Related
I'm looking for an algorithm similar to binary search but which works with data structures that are circular in nature, like a circular buffer for example. I'm working on a problem which is quite complicated, but I's able to strip it down, so it's easier to describe (and, I hope, easier to find a solution).
Let's say we have got an array of numbers with both its ends connected and an view window which can move forward and backward and which can get a value from the array (it's something like a C++ iterators which can go forward and backward). One of the values in the array is zero, which is our "sweet point" we want to find.
What we know about values in the array are:
they are sorted, which means when we move our window forward, the numbers grow (and vice versa),
they are not evenly spaced: if for example we read "16", it doesn't mean if we go 16 elements backward, we reach zero,
at last but not least: there is a point in the array where, up to that point values are positive, but after that point they are "flipped over" and start at a negative value (it is something like if we were adding ones to an integer variable until the counter goes around)
The last one is where my first approach to the problem with binary search fails. Also, if I may add, the reading a value operation is expensive, so the less often it is done the better.
PS: I'm looking for C++ code, but if You know C#, Java, JavaScript or Python and You like to write the algorithm in one of those languages, then it's no problem :).
If I understand correctly, you have an array with random access (if only sequential is allowed, the problem is trivial; that "window" concept does not seem relevant), holding a sequence of positive then negative numbers with a zero in between, but this sequence is rotated arbitrarily. (Seeing the array as a ring buffer just obscures the reasoning.)
Hence you have three sections, with signs +-+ or -+-, and by looking at the extreme elements, you can tell which of the two patterns holds.
Now the bad news: no dichotomic search can work, because whatever the order in which you sample the array, you can always hit elements of the same sign, except in the end (in the extreme case of a single element of opposite sign).
This contrasts with a standard dichotomic case that would correspond to a +- or -+ pattern: hitting two elements of the same sign allows you to discard the whole section in-between.
If the positive and negative subsequences are known to have length at least M, by sampling every M/2 element you will certainly find a change of sign and can start two dichotomies.
You can solve your problem using a galloping (exponential) search.
For simplicity I assume there are no duplicate items.
Start from the back and progress to the left in direction of smaller values. You begin with a jump of one index to the left, each next jump is exponentially bigger. With each jump to the left you should find a smaller value. If you encounter a greater value that means that zero is somewhere between the last two visited indexes. The only case when you will never encounter a greater value is when the zero is exactly at the beginning of the array.
After the jump from index i to i-j that jumped over zero, you've got a range in which zero resides. Since the jump was too far, try jumping from i to i-j/2. If that's still too far (overjumped zero) you try i-j/4 and so on. So this time each jump tried is exponentially smaller. With each step you divide the possible range where zero resides by half. On the other hand, if i-j is too far, but i-j/2 is too near (not reached zero yet), you try i-j/2-j/4. I hope you get the idea now.
This has O(lg n) complexity.
Essentially I have a matrix of floats ranging from 0-1, and I need to find the combination of values with the lowest sum. The kicker is that once a value is selected, no other values from that row or column may be used. All the columns must be used.
In the case the matrix's width is greater than height, it will be padded with 1's to make the matrix square. In the case the height is greater than width, simply not all the rows will be used, but all of the columns must ALWAYS be used.
I have looked into binary trees and Dijkstra's algorithm for this task, but both seem to get far too complex with larger matrices. Ideally I'm looking for an algorithm or implementation which will provide a very good guess in a relatively short amount of time. Anything optimized for c++ would be great!
I think Greedy Approach should work here for the good guess/optimized part.
Put all the elements in an array as a tuple < value, row, column >
Sort the list with <value> parameter of the tuple.
Greedily pick the elements from beginning, and keep track of the used column/row with either using bitset or boolean matrix as suggested #Thomas Mathews.
The total Complexity will be NMlog(NM) where N is the number of rows, and M no. of columns.
Amit's suggestion to change the title actually led me to finding the solution. It is an "Assignment Problem" and the solution is to use the Hungarian algorithm. I knew there had to be something out there already, I just wasn't sure how to phrase it to find the answer until now. Thanks for all the help.
You can follow the Dijkstra algorithm for the shortest path, assuming you are constructing a tree. In the root node you select a length of 0, and for each node you select the next accesible element that gives you the shortest path from the root node, and store that length (from the root) in the node. You'll add at each iteration, for all the leaves, the arc that makes the total length lesser, and will continue until you get a N nodes path (or a bitmask of 0, see below). The first branch of N nodes from the root will be the shortest path. At each node, you can store a bitmap of the already visited nodes (or you can determine it, looking at the parents) as the possible nodes from it are the unvisited ones only. Or you can have a bitmap of the non-visited ones. This will make the search easier, as you'll stop as soon as no bits are on in the mask.
You have not shown any code or intent to solve the problem, so I'll do that same thing (it seems to be some kind of homework, and you seem not have work on it at all by now) This is an academic problem, already shown in many programming courses in relation with Simplex and operations investigation, in object/resource assignment, so there must be plenty literature about it.
So when balancing a KD tree you're supposed to find the median and then put all the elements that are less on the left subtree and those greater on the right. But what happens if you have multiple elements with the same value as the median? Do they go in the left subtree, the right or do you discard them?
I ask because I've tried doing multiple things and it affects the results of my nearest neighbor search algorithm and there are some cases where all the elements for a given section of the tree will all have the exact same value and so I don't know how to split them up in that case.
It does not really matter where you put them. Preferably, keep your tree balanced. So place as many on the left as needed to keep the optimal balance!
If your current search radius touches the median, you will have to check the other part, that's all you need to handle tied objects on the other side. This is usually cheaper than some complex handling of attaching multiple elements anywhere.
When doing a search style algorithm, it is often a good idea to put elements equal to your median on both sides of the median.
One method is to put median equaling elements on the "same side" as where they where before you did your partition. Another method is to put the first one on the left, and the second one on the right, etc.
Another solution is to have a clumping data structure that just "counts" things that are equal instead of storing each one individually. (if they have extra state, then you can store that extra state instead of just a count)
I don't know which is appropriate in your situation.
That depends on your purpose.
For problems such as exact-matching or range search, possibility of repetitions of the same value on both sides will complicate the query and repetition of the same value on both leaves will add to the time-complexity.
A solution is storing all of the medians (the values that are equal to the value of median) on the node, neither left nor right. Most variants of kd-trees store the medians on the internal nodes. If they happen to be many, you may consider utilizing another (k-1)d tree for the medians.
The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
-
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)
I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?
You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.
I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.
A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.