I have this image:
Humans can tell that two lines can be fitted through the points. A naive algorithm would put a horizontal best fit line. Is there an algorithm that best fits a series of points while ignoring distant outliers?
There are robust estimation techniques to fit a model to noisy data, such as RANSAC. You would need to fit one line, exclude all the points that belong to that line, and the fit the second line to the remaining points.
Straight from David Forsyth web page (author of the book: Forsyth, David A. and Jean Ponce (2002). Computer Vision: A Modern Approach. Prentice
Hall Professional Technical Reference) the following is Algorithm 15.2:
Hypothesize k lines (perhaps uniformly at random)
or
hypothesize an assignment of lines to points and then fit lines using
this assignment
Until convergence
allocate each point to the closest line
refit lines
end
In your case k is 2.
The Hough transform is suitable for this task. Basically, each point votes for the existence of all lines that pass through it (in a line-parameter-space, e.g. rho-theta for distance from origin and angle). If the parameter space is discretized, then you'll get peaks for each of the lines present in your data. The outliers will have voted for parameters that have little votes from other points, so they will have low count in the parameter-space.
The image below (from Wikipedia) illustrates the concept in the ideal case (the points actually lie on exact lines). With read data, the peaks will be fuzzier, but you'll still be able to distinguish them from the outliers. The pros of this method is that you do not have to hypothesize how many lines there are, and it works well for many types of images/data. The cons are that it may fail if there are many non-linear distractors, such as in natural scenes containing many curved objects.
Related
I have:
- a set of points of known size (in my case, only 6 points)
- a line characterized by x = s + t * r, where x, s and r are 3D vectors
I need to find the point closest to the given line. The actual distance does not matter to me.
I had a look at several different questions that seem related (including this one) and know how to solve this on paper from my highschool math classes. But I cannot find a solution without calculating every distance, and I am sure there has to be a better/faster way. Performance is absolutely crucial in my application.
One more thing: All numbers are integers (coordinates of points and elements of s and r vectors). Again, for performance reasons I would like to keep the floating-point math to a minimum.
You have to process every point at least once to know their distance. Unless you want to repeat the process many times with different lines, simply computing the distance of every point is unavoidable. So the algorithm has to be O(n).
Since you don't care about the actual distance, we can make some simplification to the point-distance computation. The exact distance is computed by (source):
d^2 = |r⨯(p-s)|^2 / |r|^2
where ⨯ is the cross product and |r|^2 is the squared length of vector r. Since |r|^2 is constant for all points, we can omit it from the distance computation without changing result:
d^2 = |r⨯(p-s)|^2
Compare the approximated square distances and keep the minimum. The advantage of this formula is that you can do everything with integers since you mentioned that all coordinates are integers.
I'm afraid you can't get away with computing less than 6 distances (if you could, at least one point would be left out -- including the nearest one).
See if it makes sense to preprocess: Is the line fixed and the points vary? Consider rotating coordinates to make the line horizontal.
As there are few points, it is doubtful that this is your bottleneck. Measure where the hot spots are, redesign algorithms/data representation, spice up compiler optimization, compile to assembly and bum that. Strictly in that order.
Jon Bentley's "Writing Efficient Programs" (sadly long out of print) and "Programming Pearls" (2nd edition) are full of advise on practical programming.
So, I have two points, say A and B, each one has a known (x, y) coordinate and a speed vector in the same coordinate system. I want to write a function to generate a set of arcs (radius and angle) that lead A to status B.
The angle difference is known, since I can get it by subtracting speed unit vector. Say I move a certain distance with (radius=r, angle=theta) then I got into the exact same situation. Does it have a unique solution? I only need one solution, or even an approximation.
Of course I can solve it by giving a certain circle and a line(radius=infine), but that's not what I want to do. I think there's a library that has a function for this, since it's quite a common approach.
A biarc is a smooth curve consisting of two circular arcs. Given two points with tangents, it is almost always possible to construct a biarc passing through them (with correct tangents).
This is a very basic routine in geometric modelling, and it is indispensable for smoothly approximating an arbirtrary curve (bezier, NURBS, etc) with arcs. Approximation with arcs and lines is heavily used in CAM, because modellers use NURBS without a problem, but machine controllers usually understand only lines and arcs. So I strongly suggest reading on this topic.
In particular, here is a great article on biarcs on biarcs, I seriously advice reading it. It even contains some working code, and an interactive demo.
If we have K sets of potentially overlapping triangles, what is a computationally efficient way of computing a new, non-overlapping set of triangles?
For example, consider this problem:
Here we have 3 triangle sets A, B, C, with some mutual overlap, and wish to obtain the non-overlapping sets A', B', C', AB, AC, BC, ABC, where for example the triangles in AC would contain the surfaces where there is exclusive overlap among A and C; and A' would contain the surfaces of A which do not overlap any other set.
I (also) propose a two step approach.
1. Find the intersection points of all triangle sides.
As pointed out in the comments, this is a well-researched problem, typically approached with line sweep methods. Here is a very nice overview, look especially at the Bentley-Ottmann algorithm.
2. Triangulate with Constrained Delaunay.
I think Polygon Triangulation as suggested by #Claudiu cannot solve your problem as it cannot guarantee that all original edges are included. Therefore, I suggest you look at Constrained Delaunay triangulations. These allow you to specify edges that must be included in your triangulation, even if they would not be included in an unconstrained Delaunay or polygon triangulation. Furthermore, there are implementations that allow you to specify a non-convex border of your triangulation outside of which no triangles are generated. This also seems to be a requirement in your case.
Implementing Constrained Delaunay is non-trivial. There is however, a somewhat dated but very nice C implementation of available from a CMU researcher (including a command line tool). See here for the theory of this specific algorithm. This algorithm also supports specification of a border. Note that the linked algorithm can do more than just Constrained Delaunay (namely quality mesh generation), but it can be configured not to add new points, which amounts to Constrained Delaunay.
Edit See comments for another implementation.
If you want something a bit more straight forward, faster to implement, and significantly less code... I'd recommend just doing some simple polygon clipping like the old software rendering algorithms used to do (especially since you're only dealing with triangles as your input). As briefly mentioned by a couple of other people, it involves splitting each triangle at the point where every other segment intersects it.
Triangles are easy, because splitting a triangle at a given plane always results in just 1 or 2 new ones (2 or 3 total). If your data set is rather large, you could introduce a quad-tree or other form of spacial organization in order to find the intersecting triangles faster as the new ones get added.
Granted, this would generate more polygons than the suggested Constrained Delaunay algorithm. But many of those algorithms don't do well with overlapping shapes and would require you to know your silhouette segments, so you'd be doing much of the same work anyhow.
And if fewer resulting triangles is a requirement, you can always do a merging pass at the end (adding neighbor information during the clipping to speed that portion up).
Anyway, good luck!
Your example is a special case of what computational geometers call "an arrangement." The CGAL Library has extensive and efficent arrangement handling routines. If you check this part of the documentation, you'll see that you can declare an empty arrangement, then insert triangles to divide the 2d plane into disjoint faces. As others have said, you'll then need to triangulate the faces that aren't already triangles. Happily CGAL also provides the routines to do this. This example of constrained Delaunay triangulation is a good place to start.
CGAL attempts to use the most efficient algorithms available that are practical to implement. In this case it looks like you can achieve O((n + k) log n) for an arrangment with n edges (3 times the number of triangles in your case) with k intersection. The algorithm uses a general technique called "sweep line". A vertical line is swept left-to-right with "events" computed and processed along the way. Events are edge endpoints and intersections. As each event is processed, a cell of the arrangement is updated.
Delaunay algorithms are typically O(n log n) for n vertices. There are several common algorithms, easily looked up or found in the CGAL references.
Even if you can't use CGAL in your work (e.g. for licensing reasons), the documentation is full of sources on the underlying algorithms: arrangements and constrained Delaunay algorithms.
Beware however that both arrangments and triangulations are notoriously hard to implement correctly due to floating point error. Robust versions often depend on rational arithmetic (available in CGAL).
To expand a bit on the comment from Ted Hopp, this should be possible by first computing a planar subdivision in which each bounded face of the output is associated with one of the sets A', B', C', AB, AC, BC, ABC, or "none". The second phase is then to triangulate the (possibly non-convex) faces in each set.
Step 1 could be performed in O((N + K) log N) time using a variation of the Bentley-Ottmann sweep line algorithm in which the current set is maintained as part of the algorithm's state. This can be determined from the line segments that have already been crossed and their direction.
Once that's done, the disjoint polygons for each set can then be broken into monotone pieces in O(N log N) time which in turn can be triangulated in O(N) time.
If you haven't already, pick up a copy of "Computational Geometry: Algorithms and Applications" by de Berg et al.
I can think of two approaches.
A more general approach is treating your input as just a set of lines and splitting the problem in two:
Polygon Detection. Take the set of lines your initial triangles make and get a set of non-overlapping polygons. This paper offers an O((N + M)^4) approach, were N is the number of line segments and M the number of intersections, which does seem a bit slow unfortunately...
Polygon Triangulation. Take each polygon from step 1 and triangulate it. This takes O(n log* n) which is practically O(n).
Another approach is to do do a custom algorithm. Solve the problem for intersecting two triangles, apply it to the first two input triangles, then for each new triangle, apply the algorithm to all the current triangles with the new one. It seems even for two triangles this isn't that straightforward, though, as there are several cases (might not be exhaustive):
No points of the triangle are in any other trianle.
No intersection
Jewish star
Two perpendicular spikes
One point of one triangle is contained in the other
Each triangle contains one point of the other
Two points of one triangle are in the other
Three points of one are in the other - one is fully contained
etc... no, it doesn't seem like that is the right approach. Leaving it here anyway for posterity.
I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
Jaro,
Jaro-Winkler,
Levenshtein,
Euclidean and
Q-gram,
I wanted to know what is the difference between them and in what situations they work best?
Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.
From Wikipedia, Jaro-Winkler:
In computer science and statistics, the Jaro–Winkler distance
(Winkler, 1990) is a measure of similarity between two strings. It is
a variant of the Jaro distance metric (Jaro, 1989, 1995) and
mainly[citation needed] used in the area of record linkage (duplicate
detection). The higher the Jaro–Winkler distance for two strings is,
the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The
score is normalized such that 0 equates to no similarity and 1 is an
exact match.
Levenshtein distance:
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences. The term edit distance is often used to refer specifically
to Levenshtein distance.
The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with
the allowable edit operations being insertion, deletion, or
substitution of a single character. It is named after Vladimir
Levenshtein, who considered this distance in 1965.
Euclidean distance:
In mathematics, the Euclidean distance or Euclidean metric is the
"ordinary" distance between two points that one would measure with a
ruler, and is given by the Pythagorean formula. By using this formula
as distance, Euclidean space (or even any inner product space) becomes
a metric space. The associated norm is called the Euclidean norm.
Older literature refers to the metric as Pythagorean metric.
And Q- or n-gram encoding:
In the fields of computational linguistics and probability, an n-gram
is a contiguous sequence of n items from a given sequence of text or
speech. The items in question can be phonemes, syllables, letters,
words or base pairs according to the application. n-grams are
collected from a text or speech corpus.
The two core
advantages of n-gram models (and algorithms that use
them) are relative simplicity and the ability to scale up – by simply
increasing n a model can be used to store more context with a
well-understood space–time tradeoff, enabling small experiments to
scale up very efficiently.
The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.
Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.
String similarity helps in a lot of different ways. For example
google's did you mean results are calculated using string similarity.
string similarity is used to correct OCR errors.
string similarity is used to correct keyboard entering errors.
string similarity is used to find most matching sequence of two DNAs in bioinformatics.
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
kitten → sitten
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.
Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"
Therefore you should keep in your mind about your problem.
I want to use string similarity functions to find corrupted data in my database.
How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?
Any idea on how to solve such problems (in C++)-
like which is the best Algorithm to use.
Say you have a n x n rectangular area black and white (O and 1) pixels and you're looking for the biggest white rectangle in this area.
I would write something simple like below:
first pass: create a set of 1 line segments for each pixel row.
second pass aggregate rectangles:
for each segment iterate on rows to find the largest rectangle containing it.
if you use another segment in the process mark it as used, not need to try it again
at any point keep only the largest rectangle found
That's only a first draft of a possible solution. It should be rewritten using a more formal algorithmic syntax and many details should be provided. Each step hides pitfalls to avoid if you want to be efficient. But it should not be too hard to code.
If I did not missed something, what I described above should basically be O(n4) in the worst case, with the first pass O(n2) used to find horizontal segments (could be quite fast with a very small loop) and the second pass probably much less thant O(n4) in practice (depends on segment size, really is nb_total_segment x nb_segment_per_line x nb_overlapping_segment).
That looks not bad to me. Can't see any obvious way to do it with better O complexity, (but of course there may be some way, O(n4) is not that good).
If you provide some details on input structure and expected result it may even be some fun to code.
What you ask for is known as blob filtering on the computer vision world.
http://en.wikipedia.org/wiki/Blob_extraction
http://en.wikipedia.org/wiki/Connected_component_labeling
http://www.aforgenet.com/framework/features/blobs_processing.html