Minimizing Sum of Distances: Optimization Problem - c++

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
k=1
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..

The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.

This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.

Related

Multidimensional array v. Sorting multiple arrays

I've hit a snag in continuing my work in a C++ program, I'm not sure what the best way to approach my problem is. Here is the situation in non-programming terms: I have a list of children and each child has a specific weight, age, and happiness. I have a way that people can visually view the bones of the child that is specific to these characteristics. (Think of an MMO character customization where there are sliders for each characteristic and when you slide the weight slider to heavy, the walk cycle looks like the character is heavier).
Before, my system had a set walk cycle for each end of the spectrum for each characteristic. For example, there is one specific walk cycle for the heaviest walk, one for the lightest walk, one for youngest walk, etc. There was not middle input, the output was the position of the slider on the scale and the heaviest walk cycle and the lightest walk cycle were averaged by a specific percentage, the position of the slider.
Now to the problem, I have a large library of preset walk cycles and each walk cycle has a specific weight, age, and happiness. So, Joe has a weight of 4, an age of 7, and happiness level of 8 and Sally 2, 3, 5. When the sliders move to a the specific value (weight 5, age 8, happiness 7). However, only one slider can be moved at one time and the slider that was moved last is the most important characteristic to find the closest match to. I want to find in my library the child that has the closest to all three of these values and Joe will be the closest.
I was told to check out using a 3 dimensional array but I would rather use an array of child objects and do multiple searches on that array which, I am a rookie and I know the search will take a bit of computing time but I keep leaning towards using the single array. I could also use a two dimensional array but I'm not sure. What data structure would be the best to search for three values in?
Thank you for any help!
How many different values can each slider take? If there are say ten values for each slider this would mean there are 10*10*10=1000 different possible character classes. If your library has less than 1000 walk cycles just reading through them all looking for the nearest match is probably gonna be fast enough.
Of course if there are 100 values for each slider then you may want something more clever. My point is there are some thing that don't have to be optimized.
Also is your library of walk cycles fixed once and for all? If so perhaps you could pre compute the walk cycle for each setting of the sliders and write that to a static array.
I agree with Wilf that the number of walk cycles is critical, as even if there are say 100,000 cycles you could easily use a brute-force find-the-maximum over...
weight_factor * diff(candidate.weight, target.weight) +
age_factor * diff(candidate.age, target.age) +
happiness_factor * diff(candidate.happiness, target.happiness)
...where the factor for the last-moved slider was higher than the others.
For more cycles than that you'd want to limit the search space somewhat, and some indices would be useful, e.g.:
map<int, map< int, map<int, vector<Cycle*>> cycles_by_weight_age_happiness;
You'd populate that adding a pointer to each walk cycle - characterised by { weight, age, happiness } - to cycles[rw(weight)][ra(age)][rh(happiness)], where each of rw, ra and rh rounded the parameters by whatever granularity you liked (e.g. round weight down to nearest 5kgs, group ages by integer part of log base 1.5 of age, leave happiness alone). Then to search you evaluate the entries "around" your target { rw(weight), ra(age), rh(happiness) } indices... the further from there you deviate (especially on the last-slider-moved parameter, the less likely you are to find a better fit than you've already seen, so tune to taste.
The above indexing is a refinement of what I think Wilf intended - just using functions to decouple the mapping from value space into vectors in the index.

C++: Find neighbouring grid points from calibration picture from unsorted list

I do have 4 lists of the x and y coordinates of calibration points. Those are in no particular order and not alligned on any axis (they come from a real calibration picture with slight rotation and distortion) but the lists have the same indexing and cannot be sorted in such a way that each list is ascending/descending. They also hold no integer values but floating point. I am now trying to find the four neighbouring points for a given point.
E.g. searching for the neighbours of the point [150,150] would return [140,140], [140,160], [160,140], [160,160] (except for them actually beeing more like [139.581239,138.28812]).
At the moment I have to look through all calibration points for each point to check. There are about 500 calibration points.
Later during the process, I need to know the 4 neighbours for a random point within the 1600x1400 grid for multiple million times. So it is crucial to find those points as fast as possible to avoid calculation time of days or even weeks.
My first approach was checking each of the ~500 calibration points for each point to check and look at their relative position to the checking point (x_calib > x and y_calib > y would be somewhere in the top, right region of the point) and calculate their distance to it. The closest point in each region (top left, top right, lower left, lower right) would then be the respective neighbour point. That seems not the be efficient at all and takes a lot of time.
The second approach was creating a rainbow table for each of the 1600x1400 points and save the respective neighbours (to be exact, to save the index in the list of coordinates). Later on, the process would check this rainbow table at position [x,y,0], [x,y,1], [x,y,2] and [x,y,3] to get the 4 indices of the 4 neighbour points. Though calculating the rainbow table takes some time (~20 minutes for those ~2 million points), this approach speeds up the later processing. Unfortunatelly, this approach makes it difficult to debug the later steps of the process because it takes this much time before the rest even starts..
I still think there should be room for optimization and I would appreciate any suggestion or help to speed up the whole thing. I allready read about the kd-tree thing but did not quite see the possibility to use it here. I'm hoping that there's an approach for this kind of unsorted (and unsortable) list of points which is more efficient than the rainbow table - or which is at least faster at creating the table.
Thanks in advance!

'Stable' multi-dimensional scaling algorithm

I have a wireless mesh network of nodes, each of which is capable of reporting its 'distance' to its neighbors, measured in (simplified) signal strength to them. The nodes are geographically in 3d space but because of radio interference, the distance between nodes need not be trigonometrically (trigonomically?) consistent. I.e., given nodes A, B and C, the distance between A and B might be 10, between A and C also 10, yet between B and C 100.
What I want to do is visualize the logical network layout in terms of connectness of nodes, i.e. include the logical distance between nodes in the visual.
So far my research has shown the multidimensional scaling (MDS) is designed for exactly this sort of thing. Given that my data can be directly expressed as a 2d distance matrix, it's even a simpler form of the more general MDS.
Now, there seem to be many MDS algorithms, see e.g. http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html and http://tapkee.lisitsyn.me/ . I need to do this in C++ and I'm hoping I can use a ready-made component, i.e. not have to re-implement an algo from a paper. So, I thought this: https://sites.google.com/site/simpmatrix/ would be the ticket. And it works, but:
The layout is not stable, i.e. every time the algorithm is re-run, the position of the nodes changes (see differences between image 1 and 2 below - this is from having been run twice, without any further changes). This is due to the initialization matrix (which contains the initial location of each node, which the algorithm then iteratively corrects) that is passed to this algorithm - I pass an empty one and then the implementation derives a random one. In general, the layout does approach the layout I expected from the given input data. Furthermore, between different runs, the direction of nodes (clockwise or counterclockwise) can change. See image 3 below.
The 'solution' I thought was obvious, was to pass a stable default initialization matrix. But when I put all nodes initially in the same place, they're not moved at all; when I put them on one axis (node 0 at 0,0 ; node 1 at 1,0 ; node 2 at 2,0 etc.), they are moved along that axis only. (see image 4 below). The relative distances between them are OK, though.
So it seems like this algorithm only changes distance between nodes, but doesn't change their location.
Thanks for reading this far - my questions are (I'd be happy to get just one or a few of them answered as each of them might give me a clue as to what direction to continue in):
Where can I find more information on the properties of each of the many MDS algorithms?
Is there an algorithm that derives the complete location of each node in a network, without having to pass an initial position for each node?
Is there a solid way to estimate the location of each point so that the algorithm can then correctly scale the distance between them? I have no geographic location of each of these nodes, that is the whole point of this exercise.
Are there any algorithms to keep the 'angle' at which the network is derived constant between runs?
If all else fails, my next option is going to be to use the algorithm I mentioned above, increase the number of iterations to keep the variability between runs at around a few pixels (I'd have to experiment with how many iterations that would take), then 'rotate' each node around node 0 to, for example, align nodes 0 and 1 on a horizontal line from left to right; that way, I would 'correct' the location of the points after their relative distances have been determined by the MDS algorithm. I would have to correct for the order of connected nodes (clockwise or counterclockwise) around each node as well. This might become hairy quite quickly.
Obviously I'd prefer a stable algorithmic solution - increasing iterations to smooth out the randomness is not very reliable.
Thanks.
EDIT: I was referred to cs.stackexchange.com and some comments have been made there; for algorithmic suggestions, please see https://cs.stackexchange.com/questions/18439/stable-multi-dimensional-scaling-algorithm .
Image 1 - with random initialization matrix:
Image 2 - after running with same input data, rotated when compared to 1:
Image 3 - same as previous 2, but nodes 1-3 are in another direction:
Image 4 - with the initial layout of the nodes on one line, their position on the y axis isn't changed:
Most scaling algorithms effectively set "springs" between nodes, where the resting length of the spring is the desired length of the edge. They then attempt to minimize the energy of the system of springs. When you initialize all the nodes on top of each other though, the amount of energy released when any one node is moved is the same in every direction. So the gradient of energy with respect to each node's position is zero, so the algorithm leaves the node where it is. Similarly if you start them all in a straight line, the gradient is always along that line, so the nodes are only ever moved along it.
(That's a flawed explanation in many respects, but it works for an intuition)
Try initializing the nodes to lie on the unit circle, on a grid or in any other fashion such that they aren't all co-linear. Assuming the library algorithm's update scheme is deterministic, that should give you reproducible visualizations and avoid degeneracy conditions.
If the library is non-deterministic, either find another library which is deterministic, or open up the source code and replace the randomness generator with a PRNG initialized with a fixed seed. I'd recommend the former option though, as other, more advanced libraries should allow you to set edges you want to "ignore" too.
I have read the codes of the "SimpleMatrix" MDS library and found that it use a random permutation matrix to decide the order of points. After fix the permutation order (just use srand(12345) instead of srand(time(0))), the result of the same data is unchanged.
Obviously there's no exact solution in general to this problem; with just 4 nodes ABCD and distances AB=BC=AC=AD=BD=1 CD=10 you cannot clearly draw a suitable 2D diagram (and not even a 3D one).
What those algorithms do is just placing springs between the nodes and then simulate a repulsion/attraction (depending on if the spring is shorter or longer than prescribed distance) probably also adding spatial friction to avoid resonance and explosion.
To keep a "stable" diagram just build a solution and then only update the distances, re-using the current position from previous solution as starting point. Picking two fixed nodes and aligning them seems a good idea to prevent a slow drift but I'd say that spring forces never end up creating a rotational momentum and thus I'd expect that just scaling and centering the solution should be enough anyway.

Efficient algorithm to deal with big-data network files for computing n nearest nodes

Problem:
I have two network files with me (say NET1 and NET2) - each has a set of nodes with unique ID for each node and geographic coordinates X and Y. Each node in NET2 is to have n connections to NET1 and the ID of n nodes will be determined by the minimum straight line distance. The output will have three fields IDs of node in NET1, NET2 and the distance between them. All the files are in tab delimited format.
One way forward..
One way to implement this is for each node in NET2, we loop through each node in NET1 and compute all NET1-NET2 distance combinations. Sort it by NET2 node id and by distance and write out the first four records for each node. But the problem is there are close to 2 million nodes on NET1, 2000 nodes in NET2 - that is 4 billion distances to be calculated and written in the first step of this algorithm... and the runtime is quite forbidding!
Request:
I was curious if any of you folks out there has faced similar issue. I would love to hear from y'all about any algorithms and data structures that can be used to speed the processing. I know that the scope of this question is very broad but I hope someone can point me the right way as I have very limited experience optimizing codes for data of this scale.
Languages:
I am trying in C++, Python and R.
Please pitch in with ideas! Help greatly appreciated!
kd-tree is one of the options. It allows you to find nearest neighbor (or a set of nearest neighbors) in reasonable time. Of course, you have to build the tree in the beginning and it takes some time. But generally, kd-tree is suitable, if you don't have to add/remove nodes in runtime, which seems to be your case. It also has better performance with lower dimension (in your case the dimension is 2).
Another possible data structure is octree (quadtree for 2D), it's simpler data structure (quite easy to implement), but kd-tree can be more efficient.

Select all points in a matrix within 30m of another point

So if you look at my other posts, it's no surprise I'm building a robot that can collect data in a forest, and stick it on a map. We have algorithms that can detect tree centers and trunk diameters and can stick them on a cartesian XY plane.
We're planning to use certain 'key' trees as natural landmarks for localizing the robot, using triangulation and trilateration among other methods, but programming this and keeping data straight and efficient is getting difficult using just Matlab.
Is there a technique for sub-setting an array or matrix of points? Say I have 1000 trees stored over 1km (1000m), is there a way to say, select only points within 30m radius of my current location and work only with those?
I would just use a GIS, but I'm doing this in Matlab and I'm unaware of any GIS plugins for Matlab.
I forgot to mention, this code is going online, meaning it's going on a robot for real-time execution. I don't know if, as the map grows to several miles, using a different data structure will help or if calculating every distance to a random point is what a spatial database is going to do anyway.
I'm thinking of mirroring the array of trees, into two arrays, one sorted by X and the other by Y. Then bubble sorting to determine the 30m range in that. I do the same for both arrays, X and Y, and then have a third cross link table that will select the individual values. But I don't know, what that's called, how to program that and I'm sure someone already has so I don't want to reinvent the wheel.
Cartesian Plane
GIS
You are looking for a spatial database like a quadtree or a kd-tree. I found two kd-tree implementations here and here, but didn't find any quadtree implementations for Matlab.
The simple solution of calculating all the distances and scanning through seems to run almost instantaneously:
lim = 1;
num_trees = 1000;
trees = randn(num_trees,2); %# list of trees as Nx2 matrix
cur = randn(1,2); %# current point as 1x2 vector
dists = hypot(trees(:,1) - cur(1), trees(:,2) - cur(2)); %# distance from all trees to current point
nearby = tree_ary((dists <= lim),:); %# find the nearby trees, pull them from the original matrix
On a 1.2 GHz machine, I can process 1 million trees (1 MTree?) in < 0.4 seconds.
Are you running the Matlab code directly on the robot? Are you using the Real-Time Workshop or something? If you need to translate this to C, you can replace hypot with sqr(trees[i].x - pos.x) + sqr(trees[i].y - pos.y), and replace the limit check with < lim^2. If you really only need to deal with 1 KTree, I don't know that it's worth your while to implement a more complicated data structure.
You can transform you cartesian coordinates into polar coordinates with CART2POL. Then selecting points inside certain radius will be strait-forward.
[THETA,RHO] = cart2pol(X-X0,Y-Y0);
selected = RHO < 30;
where X0, Y0 are coordinates of the current location.
My guess is that trees are distributed roughly evenly through the forest. If that is the case, simply use 30x30 (or 15x15) grid blocks as hash keys into an closed hash table. Look up the keys for all blocks intersecting the search circle, and check all hash entries starting at that key until one is flagged as the last in its "bucket."
0---------10---------20---------30--------40---------50----- address # line
(0,0) (0,30) (0,60) (30,0) (30,30) (30,60) hash key values
(1,3) (10,15) (3,46) (24,9.) (23,65.) (15,55.) tree coordinates + "." flag
For example, to get the trees in (0,0)…(30,30), map (0,0) to the address 0 and read entries (1,3), (10,15), reject (3,46) because it's out of bounds, read (24,9), and stop because it's flagged as the last tree in that sector.
To get trees in (0,60)…(30,90), map (0,60) to address 20. Skip (24, 9), read (23, 65), and stop as it's last.
This will be quite memory efficient as it avoids storing pointers, which would otherwise be of considerable size relative to the actual data. Nevertheless, closed hashing requires leaving some empty space.
The illustration isn't "to scale" as in reality there would be space for several entries between the hash key markers. So you shouldn't have to skip any entries unless there are more trees than average in a local preceding sector.
This does use hash collisions to your advantage, so it's not as random as a hash function typically is. (Not every entry corresponds to a distinct hash value.) However, as dense sections of forest will often be adjacent, you should randomize the mapping of sectors to "buckets," so a given dense sector will hopefully overflow into a less dense one, or the next, or the next.
Additionally, there is the issue of empty sectors and terminating iteration. You could insert a dummy tree into each sector to mark it as empty, or some other simple hack.
Sorry for the long explanation. This kind of thing is simpler to implement than to document. But the performance and the footprint can be excellent.
Use some sort of spatially partitioned data structure. A simple solution would be to simply create a 2d array of lists containing all objects within a 30m x 30m region. Worst case is then that you only need to compare against the objects in four of those lists.
Plenty of more complex (and potentially beneficial) solutions could also be used - something like bi-trees are a bit more complex to implement (not by much though), but could get more optimum performance (especially in cases where the density of objects varies considerably).
You could look at the voronoi diagram support in matlab:
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/voronoi.html
If you base the voronoi polygons on your key trees, and cluster the neighbouring trees into those polygons, that would partition your search space by proximity (finding the enclosing polygon for a given non-key point is fast), but ultimately you're going to get down to computing key to non-key distances by pythagoras or trig and comparing them.
For a few thousand points (trees) brute force might be fast enough if you have a reasonable processor on board. Compute the distance of every other tree from tree n, then select those within 30'. This is the same as having all trees in the same voronoi polygon.
Its been a few years since I worked in GIS but I found the following useful: 'Computational Geometry In C' Joseph O Rourke, ISBN 0-521-44592-2 Paperback.