Hard sorting problem - what type of algorithm should I be using? - c++

The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)

Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.

but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.

If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.

If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.

ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.

Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.

It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)


How can I remove too close points in a list

I have a list of points with x,y coordinates:
List_coord=[(462, 435), (491, 953), (617, 285),(657, 378)]
This list lenght (4 element here) can be very large from few hundred up to 35000 elements.
I want to remove too close points by threshold in this list.
note:Points are never at the exact same position.
My current code for that:
while iteration<5:
for pt in List_coord:
for PT in List_coord:
if (abs(pt[0]-PT[0])+abs(pt[1]-PT[1]))!=0 and abs(pt[0]-PT[0])<threshold and abs(pt[1]-PT[1])<threshold:
Explication of my terrible code :) :
I check if the very distance is 0 then it means that i am comparing
the same point
then i check the distance in x and in y..
I need few iterations to avoid missing one remove because the list change inside the loop itself...
This code is working but it is a very low process!
I am sure there is another method much easier but i wasn't able to find even if some allready answered questions are close to mine..
note:I would like to avoid using extra library for that code if it is possible
Python will be a bit slow at this ;-)
The solution you will probably want is called quad-trees, but I'll mention a simpler approach first, in case it's preferable.
The usual approach is to group the points so that you can easily reject points that are clearly far away from each other.
One approach might be to sort the list twice, once by x once by y. You can prove that if two points are too-close, they must be close in one dimension or the other. Thus your inner loop can break out early. If it sees a point that is too far away from the outer point in the sorted direction, it can know for a fact that all future points in that list are also too far away. Thus it doesn't have to look any further. Do this in X and Y and you're set!
This approach is going to tend to be dominated by the O(n log n) sort times. However, if all of your points share a single x value, you'll end up doing the same slow O(n^2) iteration that you're doing right now because you never terminate the inner loop early.
The more robust solution is to use quadtrees. Quadtrees are designed to solve the kind of problem you are looking at. The idea is to build a tree such that you can rapidly exclude large numbers of points. I'd recommend this.
If your number of points gets too large, I'd recommend getting a clustering library. Efficient clustering is a very difficult task, and often done in C++ or another fast language.

Algorithm to assemble a simplified jigsaw puzzle where all edges are identified

Are there any kind of algorithms out there that can assist and accelerate in the construction of a jigsaw puzzle where the edges are already identified and each edge is guaranteed to fit exactly one other edge (or no edges if that piece is a corner or border piece)?
I've got a data set here that is roughly represented by the following structure:
struct tile {
int a, b, c, d;
tile[SOME_LARGE_NUMBER] = ...;
Each side (a, b, c, and d) is uniquely indexed within the puzzle so that only one other tile will match an edge (if that edge has a match, since corner and border tiles might not).
Unfortunately there are no guarantees past that. The order of the tiles within the array is random, the only guarantee is that they're indexed from 0 to SOME_LARGE_NUMBER. Likewise, the side UIDs are randomized as well. They all fall within a contiguous range (where the max of that range depends on the number of tiles and the dimensions of the completed puzzle), but that's about it.
I'm trying to assemble the puzzle in the most efficient way possible, so that I can ultimately address the completed puzzle using rows and columns through a two dimensional array. How should I go about doing this?
The tile[] data defines an undirected graph where each node links with 2, 3 or 4 other nodes. Choose a node with just 2 links and set that as your origin. The two links from this node define your X and Y axes. If you follow, say, the X axis link, you will arrive at a node with 3 links — one pointing back to the origin, and two others corresponding to the positive X and Y directions. You can easily identify the link in the X direction, because it will take you to another node with 3 links (not 4).
In this way you can easily find all the pieces along one side until you reach the far corner, which only has two links. Of all the pieces found so far, the only untested links are pointing in the Y direction. This makes it easy to place the next row of pieces. Simply continue until all the pieces have been placed.
This might be not what you are looking for, but because you asked for "most efficient way possible", here is a relatively recent scientific solution.
Puzzles are a complex combinatorial problem (NP-complete) and require some help from Academia to solve them efficiently. State of the art algorithms was recently beaten by genetic algorithms.
Depending on your puzzle sizes (and desire to study scientific stuff ;)) you might be interested in this paper: A Genetic Algorithm-Based Solver for Very Large Jigsaw Puzzles . GAs would work around in surprising ways some of the problems you encounter in classic algorithms.
Note that genetic algorithms are embarrassingly parallel, so there is a straightforward way to do calculations on parallel machines, such as multi-core CPUs, GPUs (CUDA/OpenCL) and even distributed/cloud frameworks. Which makes them hundreds to thousands times faster. GPU-accelerated GAs unlock puzzle sizes unavailable for conventional algorithms.

Get all build orders from a given model.

I have problem. I need for my work all possible build orders for some components. As a simple example you can imagine a simple Lego pyramid:
I tried some kind of DFS but it didn't work out. There are missing some possbilities at the end.
Can anyone help me with that? Language should be C++ but I just need a hint not a complete algorithm.
Some informations: The models are available as XML files. There you can find all neighbourhood relationships in all 3 directions (x, y, z). All pieces have an unique name/id. The beginning is not defined. There is no restriction in the build order. So you don't have to finish one level of the pyramid to start another one. I know there are a lot of possible build orders. Even the 3x3-base on its own has a lot of possibilities (nine factorial). But it doesn't matter at the moment.
Please I need help.
First, treat each layer (or "course") as an independent problem. Consider the nine bricks on the bottom; ignoring all others, there are 9! possible orders, so generate those, call them P. Likewise the 4! possible orders for the middle bricks are Q. We can ignore the single brick at the top for now.
Iterate over P and Q. Given an ordering of the bottom bricks, p, and of the middle bricks, q, it may be that the first move of q (i.e. laying the first mid-level brick) is possible before the bottom is complete, so we can intersperse that move with the moves of p; for each permitted time of the first of q, iterate over the permitted times of the second of q, and for each of them iterate over the permitted times of the third, and so on.
Notice that the top brick must always be placed last. Good thing, too.
Is that enough to go on?
If there aren't any limits to the order of the placement, then there are exactly n! different permutations of the orders that the blocks can be placed. In that case, a simple solution is to put all the blocks (or rather, their id's) into a vector and generate all the permutations with std::next_permutation.

c++ discrete distribution sampling with frequently changing probabilities

Problem: I need to sample from a discrete distribution constructed of certain weights e.g. {w1,w2,w3,..}, and thus probability distribution {p1,p2,p3,...}, where pi=wi/(w1+w2+...).
some of wi's change very frequently, but only a very low proportion of all wi's. But the distribution itself thus has to be renormalised every time it happens, and therefore I believe Alias method does not work efficiently because one would need to build the whole distribution from scratch every time.
The method I am currently thinking is a binary tree (heap method), where all wi's are saved in the lowest level, and then the sum of each two in higher level and so on. The sum of all of them will be in the highest level, which is also a normalisation constant. Thus in order to update the tree after change in wi, one needs to do log(n) changes, as well as the same amount to get the sample from the distribution.
Q1. Do you have a better idea on how to achieve it faster?
Q2. The most important part: I am looking for a library which has already done this.
explanation: I have done this myself several years ago, by building heap structure in a vector, but since then I have learned many things including discovering libraries ( :) ), and containers such as map... Now I need to rewrite that code with higher functionality, and I want to make it right this time:
so Q2.1 is there a nice way to make a c++ map ordered and searched not by index, but by a cumulative sum of it's elements (this is how we sample, right?..). (that is my current theory how I would like to do it, but it doesnt have to be this way...)
Q2.2 Maybe there is some even nicer way to do the same? I would believe this problem is so frequent that I am very surprised I could not find some sort of library which would do it for me...
Thank you very much, and I am very sorry if this has been asked in some other form, please direct me towards it, but I have spent a good while looking...
Edit: There is a possibility that I might need to remove or add the elements as well, but I think I could avoid it, if that makes a huge difference, thus leaving only changing the value of the weights.
Edit2: weights are reals in general, I would have to think if I could make them integers...
I would actually use a hash set of strings (don't remember the C++ container for it, you might need to implement your own though). Put wi elements for each i, with the values "w1_1", "w1_2",... all through "w1_[w1]" (that is, w1 elements starting with "w1_").
When you need to sample, pick an element at random using a uniform distribution. If you picked w5_*, say you picked element 5. Because of the number of elements in the hash, this will give you the distribution you were looking for.
Now, when wi changes from A to B, just add B-A elements to the hash (if B>A), or remove the last A-B elements of wi (if A>B).
Adding new elements and removing old elements is trivial in this case.
Obviously the problem is 'pick an element at random'. If your hash is a closed hash, you pick an array cell at random, if it's empty - just pick one at random again. If you keep your hash 3 or 4 times larger than the total sum of weights, your complexity will be pretty good: O(1) for retrieving a random sample, O(|A-B|) for modifying the weights.
Another option, since only a small part of your weights change, is to split the weights into two - the fixed part and the changed part. Then you only need to worry about changes in the changed part, and the difference between the total weight of changed parts and the total weight of unchanged parts. Then for the fixed part your hash becomes a simple array of numbers: 1 appears w1 times, 2 appears w2 times, etc..., and picking a random fixed element is just picking a random number.
Updating your normalisation factor when you change a value is trivial. This might suggest an algorithm.
w_sum = w_sum_old - w_i_old + w_i_new;
If you leave p_i as a computed property p_i = w_i / w_sum you would avoid recalculating the entire p_i array at the cost of calculating p_i every time they are needed. You would, however, be able to update many statistical properties without recalculating the entire sum
expected_something = (something_1 * w_1 + something_2 * w_2 + ...) / w_sum;
With a bit of algebra you can update expected_something by subtracting the contribution with the old weight and add the contribution with the new weight, multiplying and dividing with the normalization factors as required.
If you during the sampling keep track of which outcomes that are part of the sample, it would be possible to propagate how the probabilities were updated to the generated sample. Would this make it possible for you to update rather than recalculate values related to the sample? I think a bitmap could provide an efficient way to store an index of which outcomes that were used to build the sample.
One way of storing the probabilities together with the sums is to start with all probabilities. In the next N/2 positions you store the sums of the pairs. After that N/4 sums of the pairs etc. Where the sums are located can, obviously, be calculate in O(1) time. This data-structure is sort of a heap, but upside down.

Balancing KD Tree

So when balancing a KD tree you're supposed to find the median and then put all the elements that are less on the left subtree and those greater on the right. But what happens if you have multiple elements with the same value as the median? Do they go in the left subtree, the right or do you discard them?
I ask because I've tried doing multiple things and it affects the results of my nearest neighbor search algorithm and there are some cases where all the elements for a given section of the tree will all have the exact same value and so I don't know how to split them up in that case.
It does not really matter where you put them. Preferably, keep your tree balanced. So place as many on the left as needed to keep the optimal balance!
If your current search radius touches the median, you will have to check the other part, that's all you need to handle tied objects on the other side. This is usually cheaper than some complex handling of attaching multiple elements anywhere.
When doing a search style algorithm, it is often a good idea to put elements equal to your median on both sides of the median.
One method is to put median equaling elements on the "same side" as where they where before you did your partition. Another method is to put the first one on the left, and the second one on the right, etc.
Another solution is to have a clumping data structure that just "counts" things that are equal instead of storing each one individually. (if they have extra state, then you can store that extra state instead of just a count)
I don't know which is appropriate in your situation.
That depends on your purpose.
For problems such as exact-matching or range search, possibility of repetitions of the same value on both sides will complicate the query and repetition of the same value on both leaves will add to the time-complexity.
A solution is storing all of the medians (the values that are equal to the value of median) on the node, neither left nor right. Most variants of kd-trees store the medians on the internal nodes. If they happen to be many, you may consider utilizing another (k-1)d tree for the medians.