How can I remove too close points in a list - python-2.7

I have a list of points with x,y coordinates:
List_coord=[(462, 435), (491, 953), (617, 285),(657, 378)]
This list lenght (4 element here) can be very large from few hundred up to 35000 elements.
I want to remove too close points by threshold in this list.
note:Points are never at the exact same position.
My current code for that:
while iteration<5:
for pt in List_coord:
for PT in List_coord:
if (abs(pt[0]-PT[0])+abs(pt[1]-PT[1]))!=0 and abs(pt[0]-PT[0])<threshold and abs(pt[1]-PT[1])<threshold:
List_coord.remove(PT)
iteration=iteration+1
Explication of my terrible code :) :
I check if the very distance is 0 then it means that i am comparing
the same point
then i check the distance in x and in y..
Iteration:
I need few iterations to avoid missing one remove because the list change inside the loop itself...
This code is working but it is a very low process!
I am sure there is another method much easier but i wasn't able to find even if some allready answered questions are close to mine..
note:I would like to avoid using extra library for that code if it is possible

Python will be a bit slow at this ;-)
The solution you will probably want is called quad-trees, but I'll mention a simpler approach first, in case it's preferable.
The usual approach is to group the points so that you can easily reject points that are clearly far away from each other.
One approach might be to sort the list twice, once by x once by y. You can prove that if two points are too-close, they must be close in one dimension or the other. Thus your inner loop can break out early. If it sees a point that is too far away from the outer point in the sorted direction, it can know for a fact that all future points in that list are also too far away. Thus it doesn't have to look any further. Do this in X and Y and you're set!
This approach is going to tend to be dominated by the O(n log n) sort times. However, if all of your points share a single x value, you'll end up doing the same slow O(n^2) iteration that you're doing right now because you never terminate the inner loop early.
The more robust solution is to use quadtrees. Quadtrees are designed to solve the kind of problem you are looking at. The idea is to build a tree such that you can rapidly exclude large numbers of points. I'd recommend this.
If your number of points gets too large, I'd recommend getting a clustering library. Efficient clustering is a very difficult task, and often done in C++ or another fast language.

Related

What's the most efficient way to store a subset of column indices of big matrix and in C++?

I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!
To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."
I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].
(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.

Find the closest point out of a vector of points

I have vector of pointers to a very simple Point class:
class Point{
public:
float x;
float y;
float z;
};
How do I find the closest object to a referent point using STL?
Do I need first sort the vector first or is there a more efficient way?
Sorting takes O(n*log(N)), so it's not very efficient. You can do it in O(n) by just iterating through the elements and memorizing the best match.
Using for_each from <algorithm>, you can define a function that keeps track of the closest elements and completes in O(n).
Or, you can probably even use min_element, also from <algorithm>.
The basic question here is how often you'll be doing queries against a single set of points.
If you're going to find one nearest point in the set one time, then #Lucian is right: you might as well leave the points un-sorted, and do a linear search to find the right point.
If you'll do a relatively large number of queries against the same set of points, it's worthwhile to organize the point data to improve query speed. #izomorphius has already mentioned a k-d tree, and that's definitely a good suggestion. Another possibility (admittedly, quite similar) is an oct-tree. Between the two, I find an oct-tree quite a bit easier to understand. In theory, a k-d tree should be slightly more efficient (on average), but I've never seen much difference -- though perhaps with different data the difference would become significant.
Note, however, that building something like a k-d tree or oct-tree isn't terribly slow, so you don't need to do an awful lot of queries against a set of points to justify building one. One query clearly doesn't justify it, and two probably won't either -- but contrary to what Luchian implies, O(N log N) (just for example) isn't really very slow. Roughly speaking, log(N) is the number of digits in the number N, so O(N log N) isn't really a whole lot slower than O(N). That, in turn, means you don't need a particularly large number of queries to justify organizing the data to speed up each one.
You can not go faster then a linear comparison if you only know that there are points in a vector. However if you have additional knowledge a lot can be improved. For instance if you know all the points are ordered and lie on the same line there is a logarithmic solution.
Also there are better data structures to solve your problem for instance a k-d tree. It is not part of the STL - you will have to implement it yourself but it is THE data structure to use to solve the problem you have.
you can try to use Quadtree
http://en.wikipedia.org/wiki/Quadtree
or something similar.

Efficient Longest arithmetic progression for a set of linear Points

Longest arithmetic progression of a set of numbers {ab1,ab2,ab3 .... abn} is defined as a subset {bb1,bb2,bb3 .... bbn} such that bi+1 - bi is constant.
I would like to extend this problems to a set of two dimensional points lying on a straight line.
Lets define Dist(P1,P2) is the distance between two Points P1(X1,Y1) and P2(X2,Y2) on a line as
Dist(P1,P2) = Dist((X1,Y1),(X2,Y2)) = (X2 - X1)2 + (Y2 - Y1))2
Now For a given set of points I need to find the largest Arithmetic Progression such that Dist(Pi,Pi+1) is constant, assuming they all lie on the same line (m & C are constant).
I researched a bit but could not figure out an algorithm which is better than O(n2).
In fact currently the way I am doing is I am maintaining a Dictionary say
DistDict=dict()
and say Points are defined in a List as
Points = [(X1,Y1),(X2,Y2),....]
then this is what I am doing
for i,pi in enumerate(Points):
for pj in Points[i+1:]:
DistDict.setdefault(dist(pi,pj),set([])).add((pi,pj))
so all I have ended up with a dictionary of points which are of equal distance. So the only thing I have to do is to scan through to find out the longest set.
I am just wondering that this ought to have a better solution, but somehow I can't figure out one. I have also seen couple of similar older SO posts but none I can find to give something that is more efficient than O(n2). Is this somehow an NP Hard problem that we can never have something better or if not what approach could be take.
Please note I came across a post which claims about an efficient divide and conquer algorithm but couldn't make any head or tail out of it.
Any help in this regard?
Note*** I am tagging this Python because I understand Python better than maybe Matlab or Ruby. C/C++/Java is also fine as I am somewhat proficient in these too :-)
To sum things up: As #TonyK pointed out, if you assume that the points lie on a straight line, you can reduce it to the one-dimensional case that was discussed extensively here already. The solution uses Fast Fourier Transforms as mentioned by #YochaiTimmer.
Additional note: The problem is almost certainly not NP-hard as it has an efficient O(n log n) solution, so that would imply P=NP.
You could study the Fast Fourier Transform methods for multiplication.O(N log N)
You might be able to do something similar with your problem.
Firstly, your definition of distance is wrong. You have to take the square root. Secondly, if you know that all the points lie on a straight line, you can simply ignore the y-coordinates (unless the line is vertical) or the x-coordinates (unless the line is horizontal). Then it reduces to the problem in your first paragraph.

Hard sorting problem - what type of algorithm should I be using?

The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
-
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)

Concurrent binary chop algorithm

Is there a way (or is it even theoretically possible) to implement a binary search algorithm concurrently? I'm guessing the answer may well be no for two reasons:
Despite lots of Googling I haven't found a concurrent implementation anywhere
Each iterative cycle of the binary chop depends on the values from the previous one, so even if each iteration was a separate thread it would have to block until the previous one completed, making it sequential.
However, I'd like some clarification on this front (and if it is possible, any links or examples?)
At first, it looks like binary search is completely nonparallel. But notice that there are only three possible outcomes:
You hit the element
The element searched for is before the element you hit
The element is after
So we start three parallel processes:
Hit the element
Assume the element is before, search here
Assume the element is after, search there
As soon as we know the result from the first of these, we can kill the one which is not going to find the element. But at the same time, the process that searched in the right spot, has doubled the search rate, that is current speedup is 2 out of a possible 3.
Naturally, this approach can be generalized if you have more than 3 cores at your disposal. An important aside is that this way of thinking is what is done inside hardware. Look up carry-lookahead adders for instance.
I think you can figure the answer! To parallelize, there must be some work that can be divided. In case of the bin-search, there is nothing that could possibly be divided and parallelized. bin-search gets into the middle of an array of values. This work cannot be divided. Etc.. until it find the solution.
What in your opinion could be parallelized?
If you have n worker threads, you can split the array in n segments and run n binary searches concurrently, combining the results when they are ready. Apart from this cheap trick, I can see no obvious way to introduce parallelism.
You could always try a not-quite-binary search, essentially if you have n cores then you can split the array into n+1 pieces. From there you search each of the "cut-points" and see whether the value is larger or smaller than the cut point, this results in you having a fifth of the original search space as opposed to half, as you will be able to select a smaller section.