Given a set of points, find smallest subset of points from which circles of n diameter can be drawn to encompass all points - python-2.7

I've got a list of places with associated lat/lon data (sites). I'm trying to find the fewest bases from which to visit the sites (minimizing travel occurrences). Any ideas? I've mostly been working with Python (2.7.3), but any suggestions/examples are welcome.

This can be viewed as the set cover problem.
Using Wikipedia's terminology, your universe will be the cities. If there are m cities, there will be m sets. k-th set will correspond to the k-th city and will include all cities within the required travel radius from k, including k itself. The task is to find the smallest number of sets that cover the universe (put another way, the smallest number of cities from which you can reach every city in your universe).
The bad news is that the problem is NP-hard. There are, however, heuristics.

Related

Binary Snap [AIO 2015]

This is a question from the Australian Informatics Olympiad
The question is:
Have you ever heard of Melodramia, my friend? It is a land of forbidden forests and boundless swamps, of sprinting heroes and dashing heroines. And it is home to two dragons, Rose and Scarlet, who, despite their competitive streak, are the best of friends.
Rose and Scarlet love playing Binary Snap, a game for two players. The game is played with a deck of cards, each with a numeric label from 1 to N. There are two cards with each possible label, making 2N cards in total. The game goes as follows:
Rose shuffles the cards and places them face down in front of Scarlet.
Scarlet then chooses either the top card, or the second-from-top card from the deck and reveals it.
Scarlet continues to do this until the deck is empty. If at any point the card she reveals has the same label as the previous card she revealed, the cards are a Dragon Pair, and whichever dragon shouts `Snap!' first gains a point.
After many millenia of playing, the dragons noticed that having more possible Dragon Pairs would often lead to a more exciting game. It is for this reason they have summoned you, the village computermancer, to write a program that reads in the order of cards in the shuffled deck and outputs the maximum number of Dragon Pairs that the dragons can find.
I'm not sure how to solve this. I thought of something which is wrong(choosing the maximum over all cards, when compared with its previous occurence for each card)
Here's my code as of now:
#include <iostream>
#include <fstream>
using namespace std;
int main() {
ifstream fin("snapin.txt");
ofstream fout("snapout.txt");
int n;
fin>>n;
int arr[(2*n)+1];
for(int i=0;i<2*n;i++){
fin>>arr[i];
}
int dp[(2*n) +1];
int maxi = 0;
int pos[n+1];
for(int i=0;i<n+1;i++){
pos[i] = -1;
}
int count = 0;
for(int i=2;i<(2*n)-2;i++){
if(pos[arr[i]] == -1){
pos[arr[i]] = i;
}else{
dp[i] = pos[arr[i]]+1;
maxi = max(dp[i],maxi);
}
dp[i] = max(dp[i],maxi);
}
fout<<dp[2*n -1];
}
Ok, let's get some basic measurements of the problem out of the way first:
There are 2N cards. 1 card is drawn at a time, without replacement. Therefore there are 2N draws, taking the deck from size 2N (before the first draw) to size 0 (after the last draw).
The final draw takes place from a deck of size 1, and must take the last remaining card.
The 2N-1 preceding draws have deck size 2N, ... 3, 2. For each of these you have a choice between the top two cards. 2N-1 decisions, each with 2 possibilities.
The brute force search space is therefore 22N-1.
That is exponential growth, every optimization scientist's favorite sort of challenge.
If N is small, say 20, the brute force method needs to search "only" a trillion possibilities, which you can get done in a few thousand seconds on a readily available PC that does a few billion operations per second (each solution takes more than one CPU instruction to check).
In N is not quite as small, perhaps 100, the brute force method is akin to breaking the encryption on government secrets.
Not happy with the brute force approach then? I'm not either.
Before we get to the optimal solution, let’s take a break to explore what the Markov assumption is and what it means for us. It shows up in different fields using different verbiage, but I’ll just paraphrase it in a way that is particularly useful for this problem involving gameplay choices:
Markov Assumption
A process is Markov if and only if The choices available to you in the future depend only on what you have now, and not how you got it.
A bad but often used real-world example is the stock market. Not only do taxation differences between short-term and long-term capital gains make history important in a small way, but investors do trend analysis and remember what stocks have done before, which affects future behavior in a big way.
A better example, especially for StackOverflow, is that of Turing machines and computer processors. What your program does next depends on the current instruction pointer and the contents of memory, but not the history of memory that’s since been overwritten. But there are many more. As we’ll see shortly, the Binary Snap problem can be formulated as Markov.
Now let’s talk about what makes the Markov assumption so important. For that, we’ll use the Travelling Salesman Problem. No, the Travelling International Salesman Problem. Still too messy. Let’s try the “Travelling International Salesman with a Single-Entry Visa Problem”. But we’ll go through all three of them briefly:
Travelling Salesman Problem
A salesman has to visit potential buyers in N cities. Plan an itinerary for the salesman which minimizes the total cost of visiting all N cities (variations: at least once / exactly once), given a matrix aj,k which is the cost of travel from city j to city k.
Another variation is whether the starting city is predetermined or not.
Travelling International Salesman Problem
The cities the salesman needs to visit are split between two (or more) nations. A subset of the cities have border crossings and have travel options to all cities. The other cities can only reach cities which are either in the same country or are border-equipped.
Alternatively, instead of cities along the border, use cities with international airports. Won’t make a difference in the end.
The cost matrix for this problem looks rather like the flag of the Dominican Republic. Travel between interior cities of country A is permitted, as is travel between interior cities of country B (blue fields). Border cities connect with interior and border cities in both countries (white cross). And direct travel between an interior city of country A and one of country B is impossible (red areas).
Travelling International Salesman with a Single-Entry Visa
Now not only does the salesman need to visit cities in both countries, but he can only cross the border once.
(For travel fanatics, assume he starts in a third country and has single-entry visas for both countries, so he can’t visit some of A, all of B, then return to A for the rest).
Let’s look at an extremely simple case first: Only one border city. We’ll use one additional trick, the one from proof by induction: We assume that all problems smaller than the current one can be solved.
It should be fairly obvious that the Markov assumption holds when the salesman reaches the border city. No matter what path he took through country A, he has exactly the same choice of paths through country B.
But there’s a really important point here: Any path through country A ending at the border and any path through country B starting at the border, can be combined into a feasible full itinerary. If we have two full itineraries x and y, and x spent more money in country A than y did, then even if x has a lower total cost than the total cost of y, we can plan a path better than both, using the portion of y in country A and the portion of x in country B. I’m going to call that “splicing”. The Markov assumption lets us do it, by making all roads leading to the border interchangeable!
In fact, we can look just at the cities of country A, pick the best of all routes to the border, and forget about all the other options as soon as (in our plan) the salesman steps across into B.
This means instead of having factorial(NA) * factorial(NB) routes to look at, there’s only factorial(NA) + factorial(NB). Which is pretty much factorial(NA) times better. Wow, is this Markov thing helpful or what?
Ok, that was too easy. Let’s mess it all up by having NAB border cities instead of just one. Now if I have a path x which costs less in country B and a path y which costs less in country A, but they cross the border in different cities, I can’t just splice them together. So I have to keep track of all the paths through all the cities again, right?
Not exactly. What if, instead of throwing away all the paths through country A except the best y path, I actually keep one path ending in each border city (the lowest cost of all paths ending in the same border city). Now, for any path x I look at in country B, I have a path yendpt(x) that uses the same border city, to splice it with. So I have to solve the country A and country B partitions each NAB times to find the best splice of a complete itinerary, for total work of NAB factorial(NA) + NAB factorial(NB) which is still way better than factorial(NA) * factorial(NB).
Enough development of tools. Let’s get back to the dragons, since they are they are subtle and quick to anger and I don’t want to be eaten or burnt to a crisp.
I claim that at any step T of the Binary Snap game, if we consider our “location” a pair of (card just drawn, card on top of deck), the Markov assumption will hold. These are the only things that determine our future options. All the cards below the top one in the deck must be in the same order no matter what we did before. And for knowing whether to count a Snap! with the next card, we need to know the last one taken. And that’s it!
Furthermore, there are N possible labels on the card last drawn, and N possible for the top card on the deck, for a total of N2 “border cities”. As we start playing the game, there are two choices on the first turn, two on the second, two on the third, so we start out with 2T possible game states (and a count of Snap!s for each). But by the pigeonhole principle, when 2T > N2, some of these plays must end in exactly the same game state (“border city”) as each other, and when that happens, we only need to keep the "dominating" one that got the best score on the way there.
Final complexity bound: 2*N timesteps, from no more than N2 game states, with 2 draw choices at each, equals an upper limit of 4*N3 simulated draws.
And that means the same trillion calculations that allowed us to do N=20 with the brute force method, now permit right around N=8000.
That makes the dragons happy, which makes us alive and well.
Implementation note: Since the challenge didn’t ask for the order of draws, but just the highest attainable number of snaps, all you data to keep track of in addition to the initial ordering of the cards is the time, T, and a 2-dimensional array (N rows, N columns) of the best score you can have and reach that state at time T.
Real world applications: If you take this approach and apply it to a digital radio (fixed uniform bit timing, discrete signal levels) receiving a signal using a convolutional error-correcting code, you have the Viterbi decoder. If you apply it to acquired medical data, with variable timing intervals and continuous signal levels, and add some other gnarly math, you get my doctoral project.

Finding the Distance Between Two Lines that represent GPS routes (MATLAB, Java, C++, or Python)

I have been researching and trying to figure this one out to no avail. I have found many ways not to solve this...
The gist of the problem: I am looking for a method to calculate the deviance from an original path traveled by way of GPS coordinates. I have multiple csv files that contain latitude, longitude, and UTC time. I have created KML files from this information for a visual viewing of the deviance and now would like to put a value on this deviation. I ahve chosen a route as a reference and would like to measure the other routes against the reference route. There are multiple routes each having it's own reference route, each of which has many runs. No two runs are the same, and some of the routes deviate more than the next. I cannot use time, only lat and lon since the runs were completed over many weeks of data collection.
What I have tried thus far:
Haversine and Equirectangular formulas (looping through and measuring point to point).
Outcome: The coordinates only line up for a short period of time and the difference in the number of points varies greatly.
Area under each curve: was going to find the difference of the two routes by this method.
Outcome: Really unsure how to proceed, nor find equations suitable for this calculation.
There were a couple more feeble attempts, but have been working on this for a few weeks now, with not much to show for and still unsure on how to proceed.
Any help or ideas would be greatly appreciated.
Possible solution 1: Instead of calculating the "sideways" deviation between the two routes, just compare the respective arc lengths (Matlab: arclength).
Possible solution 2: To compare two routes, each going from the same start A to the same end point B: Draw a straight line between A and B, place a number of equidistant points along AB, and then average the perpendicular distance from these points on AB to the paths you want to compare. The absolute difference between the cumulative deviations from the straight-line reference is your deviation.
Possible solution 3: Calculate the arc length of each route. Place a number of equidistant points along each route. Average the distance between these points.
Both solution 2 and 3 will depend on the number of points you place, but with a higher number of points, the average deviation will converge. Note that these solutions are both related to calculating the area under each curve.

How to solve binary labeling with graph cut?

I have 32 segments of the overlapped regions of two images. I have to assign each of the segment to either one of the images based on lowest cost. So, it is a binary labeling problem, and above are the energy minimization function.
L is the vector of length of 32(equal to the number of segments) and value of each element depends on its index corresponding to the segment number. Say, if 3rd segment is assigned image 1, then L(2)=0, and 14th segments is assigned to image 2, so L(13)=1. That is L[x]'s value is either 0 or 1. Thus, there are 2^32 possible assignment of L. So, I can compute E(L) for each combination, after performing 2^32 calculation, I can get the minimum E(L), and use that combination. This is what my intuition suggests. But this is impractical, because the complexity is exponential.
But, many literatures suggest this binary labeling problem can be solved as a graph cut problem with max flow/min cut algorithm. But, how do I formulate this problem as max flow/min cut problem? The 32 segments are the nodes of the graph, but what would be the weight of the edges? And what would be the capacity?
The formulation as a graph theory problem and proof of the "if and only if" relationship can be found in "What Energy Functions Can Be Minimized
via Graph Cuts?" by Vladimir Kolmogorov and Ramin Zabih.
The key idea is to construct a directed edge between i and j of weight Vij(0,1)+Vij(1,0)-Vij(0,0)-Vij(1,1).
If Vij(1,0)-Vij(0,0)>0 you also need to construct a directed edge between the source and i of weight Vij(1,0)-Vij(0,0).
Otherwise you need to construct a directed edge between i and the destination of weight Vij(0,0)-Vij(1,0).
Similarly, if Vij(0,1)-Vij(0,0)>0 you also need to construct a directed edge between the source and j of weight Vij(0,1)-Vij(0,0).
Otherwise you need to construct a directed edge between j and the destination of weight Vij(0,0)-Vij(0,1).
Note that the min-cut of this graph will be offset by V(0,0)-sum of weights on edges connecting to the destination.

Minimizing Sum of Distances: Optimization Problem

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
k=1
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..
The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.
This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.

Select all points in a matrix within 30m of another point

So if you look at my other posts, it's no surprise I'm building a robot that can collect data in a forest, and stick it on a map. We have algorithms that can detect tree centers and trunk diameters and can stick them on a cartesian XY plane.
We're planning to use certain 'key' trees as natural landmarks for localizing the robot, using triangulation and trilateration among other methods, but programming this and keeping data straight and efficient is getting difficult using just Matlab.
Is there a technique for sub-setting an array or matrix of points? Say I have 1000 trees stored over 1km (1000m), is there a way to say, select only points within 30m radius of my current location and work only with those?
I would just use a GIS, but I'm doing this in Matlab and I'm unaware of any GIS plugins for Matlab.
I forgot to mention, this code is going online, meaning it's going on a robot for real-time execution. I don't know if, as the map grows to several miles, using a different data structure will help or if calculating every distance to a random point is what a spatial database is going to do anyway.
I'm thinking of mirroring the array of trees, into two arrays, one sorted by X and the other by Y. Then bubble sorting to determine the 30m range in that. I do the same for both arrays, X and Y, and then have a third cross link table that will select the individual values. But I don't know, what that's called, how to program that and I'm sure someone already has so I don't want to reinvent the wheel.
Cartesian Plane
GIS
You are looking for a spatial database like a quadtree or a kd-tree. I found two kd-tree implementations here and here, but didn't find any quadtree implementations for Matlab.
The simple solution of calculating all the distances and scanning through seems to run almost instantaneously:
lim = 1;
num_trees = 1000;
trees = randn(num_trees,2); %# list of trees as Nx2 matrix
cur = randn(1,2); %# current point as 1x2 vector
dists = hypot(trees(:,1) - cur(1), trees(:,2) - cur(2)); %# distance from all trees to current point
nearby = tree_ary((dists <= lim),:); %# find the nearby trees, pull them from the original matrix
On a 1.2 GHz machine, I can process 1 million trees (1 MTree?) in < 0.4 seconds.
Are you running the Matlab code directly on the robot? Are you using the Real-Time Workshop or something? If you need to translate this to C, you can replace hypot with sqr(trees[i].x - pos.x) + sqr(trees[i].y - pos.y), and replace the limit check with < lim^2. If you really only need to deal with 1 KTree, I don't know that it's worth your while to implement a more complicated data structure.
You can transform you cartesian coordinates into polar coordinates with CART2POL. Then selecting points inside certain radius will be strait-forward.
[THETA,RHO] = cart2pol(X-X0,Y-Y0);
selected = RHO < 30;
where X0, Y0 are coordinates of the current location.
My guess is that trees are distributed roughly evenly through the forest. If that is the case, simply use 30x30 (or 15x15) grid blocks as hash keys into an closed hash table. Look up the keys for all blocks intersecting the search circle, and check all hash entries starting at that key until one is flagged as the last in its "bucket."
0---------10---------20---------30--------40---------50----- address # line
(0,0) (0,30) (0,60) (30,0) (30,30) (30,60) hash key values
(1,3) (10,15) (3,46) (24,9.) (23,65.) (15,55.) tree coordinates + "." flag
For example, to get the trees in (0,0)…(30,30), map (0,0) to the address 0 and read entries (1,3), (10,15), reject (3,46) because it's out of bounds, read (24,9), and stop because it's flagged as the last tree in that sector.
To get trees in (0,60)…(30,90), map (0,60) to address 20. Skip (24, 9), read (23, 65), and stop as it's last.
This will be quite memory efficient as it avoids storing pointers, which would otherwise be of considerable size relative to the actual data. Nevertheless, closed hashing requires leaving some empty space.
The illustration isn't "to scale" as in reality there would be space for several entries between the hash key markers. So you shouldn't have to skip any entries unless there are more trees than average in a local preceding sector.
This does use hash collisions to your advantage, so it's not as random as a hash function typically is. (Not every entry corresponds to a distinct hash value.) However, as dense sections of forest will often be adjacent, you should randomize the mapping of sectors to "buckets," so a given dense sector will hopefully overflow into a less dense one, or the next, or the next.
Additionally, there is the issue of empty sectors and terminating iteration. You could insert a dummy tree into each sector to mark it as empty, or some other simple hack.
Sorry for the long explanation. This kind of thing is simpler to implement than to document. But the performance and the footprint can be excellent.
Use some sort of spatially partitioned data structure. A simple solution would be to simply create a 2d array of lists containing all objects within a 30m x 30m region. Worst case is then that you only need to compare against the objects in four of those lists.
Plenty of more complex (and potentially beneficial) solutions could also be used - something like bi-trees are a bit more complex to implement (not by much though), but could get more optimum performance (especially in cases where the density of objects varies considerably).
You could look at the voronoi diagram support in matlab:
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/voronoi.html
If you base the voronoi polygons on your key trees, and cluster the neighbouring trees into those polygons, that would partition your search space by proximity (finding the enclosing polygon for a given non-key point is fast), but ultimately you're going to get down to computing key to non-key distances by pythagoras or trig and comparing them.
For a few thousand points (trees) brute force might be fast enough if you have a reasonable processor on board. Compute the distance of every other tree from tree n, then select those within 30'. This is the same as having all trees in the same voronoi polygon.
Its been a few years since I worked in GIS but I found the following useful: 'Computational Geometry In C' Joseph O Rourke, ISBN 0-521-44592-2 Paperback.