How to implement k-means algorithm on string data - data-mining

I am trying to implement K-means algorithm on the below data-set.It's stragiht-forward to calculate distance between any two numeric attributes but how do I calculate distance between two strings and also how do I sum up all the distances(i.e the distance between string attributes and the distance between numeric attributes.) Please kindly advise me.Thank you.

K-means is designed for Euclidean distance. You cannot just plug in arbitrary other distance functions. This may cause k-means to no longer converge.
The required property is that the mean must minimize the variances. If you cannot guarantee this property (and what is the mean of a string anyway?) then you lose guaranteed convergence.
Technically, k-means isn't even based on Euclidean distance, but it minimizes variances, which happen to be the same as squared Euclidean distances; and if you minimize these squares, you also minimize Euclidean distance. But what the algorithm really aims at minimizing is Var(Attribute 1, Cluster 1) + Var(Attribute 2, Cluster 1) + ... + Var(Attribute n, Cluster k).
You might want to look into k-medians, which by using a medoid instead of the mean, avoids both the need to be able to compute a mean and can give convergence guarantees for arbitrary distances as far as I know.
However, you might want to look into truly distance based algorithms, including the various density based clustering algorithms which usually also are distance-based.

To calculate the distance between strings, you can use the Levenshtein distance (aka edit distance).
To normalize the values between the string and numeric attributes, you can can try to state the attributes as percentages: find the min and max value of each type of attribute, and then for a given data instance, calculate its percentage within the respective range.

Related

Finding the best algorithm for nearest neighbor search in a 2D plane with moving points

I am looking for an efficient way to perform nearest neighbor searches within a specified radius in a two-dimensional plane. According to Wikipedia, space-partitioning data structures, such as :
k-d trees,
r-trees,
octrees,
quadtrees,
cover trees,
metric trees,
BBD trees
locality-sensitive hashing,
and bins,
are often used for organizing points in a multi-dimensional space and can provide O(log n) performance for search and insert operations. However, in my case, the points in the two-dimensional plane are moving at each iteration, so I need to update the tree accordingly. Rebuilding the tree from scratch at each iteration seems easier, but I would like to avoid it if possible because the points only move slightly between iterations.
I have read that k-d trees are not naturally balanced, which could be an issue in my case. R-trees, on the other hand, are better suited for storing rectangles. Bin algorithms, on the other hand, are easy to implement and provide near-linear search performance within local bins.
I am working on an autonomous agent simulation where 1,000,000 agents are rendered in the GPU, and the CPU is responsible for computing the next movement of each agent. Each agent is influenced by other agents within its line of sight, or in other words, other agents within a circular sector of angle θ and radius r. So here specific requirements for my use case:
Search space is a 2-d plane,
Each object is a point identified with the x,y coordinate.
All points are frequently updated by a small factor.
Cannot afford any O(n^2) algorithms.
Search within a radius (circular sector)
Search for all candidates within the search surface.
Given these considerations, what would be the best algorithms for my use case?
I think you could potentially solve this by doing a sort of scheduling approach. If you know that no object will move more than d distance in each iteration, and you want to know which objects are within X distance of each other on each iteration, then given the distances between all objects you know that on the next iteration the only potential pairs of objects that would change their neighbor status would be those with a distance between X-d and X+d. The iteration after that it would be X-2d and X+2d and so on.
So I'm thinking that you could do an initial distance calculation between all pairs of objects, and then based on each difference you can create an NxN matrix where the value in each cell is which iteration you will need to re-check their distance. Then when you re-check those during that iteration, you would update their values in this matrix for the next iteration that they need to be checked.
The only problem is whether calculating an initial NxN distance matrix is feasible.

How to find the best combination to minimize the total distance in all groups of points

Given 4n+k points with the position (x,y) (or (x,y,z) for a 3D case), where n=1,2,3,...; k∈{0,1,2,3}.
Group the points into n-k groups of 4 points, and k groups of 5 points.
A group centroid is the mean position of the 4(or 5) points in the group.
How to effectively get the best combination to minimize the sum of the distance of each point to its own group centroid?
Brutal enumeration is the only way I've achieve to get the best combination. However brutal force only works when n is quite small because of the computational limitation.
I also tried K-Means clustering and genetic algorithm, but neither of them nor the combination of these two algorithms can guarantee the best combination.
As the problem supposedly is NP-hard you won't be able to find an algorithm that guarantees to find the optimum and that runs in acceptable time on large data.
You'll have to settle for an approximation such as kmeans.

Handle very large distance matrix in C (or C++ if it could help)

I am implementing this clustering algorithm http://www.sciencemag.org/content/344/6191/1492.full (free access version) in C in my software and I need to build a distance matrix, but in some cases, the size of the dataset (after redundancy removal) is huge (n > 1 500 000 and it is even larger, going up to 4 000 000 on more complex cases). My problem is, even allocating the upper triangular matrix would be ( (1500000*1500000) - 1500000) * 0.5 * sizeof(float) =~ 5.5e12 Bytes. So, memory allocation fails (even on our computing nodes with 256 GB of RAM) and writing to disk is not an option in this case.
Beside cutting down the size (which I will look) of the dataset to cluster, anybody has an idea of a technique I could use to approximate and store this amount of information ?
N.B. Like I said in the title, I am using C and I can also use C++. Also, if anybody has another clustering algorithm (where the number of clusters is determined with the algorithm itself) to use, please suggest it to me.
Thanks in advance for your time,
You probably have to step back and reconsider your algorithm.
First, perhaps you don't need to have distance matrix between all pairs of data points. Perhaps you could group together similar data points into data bins and then create a matrix of distances between bins.
That is, start by computing pairwise distances between points, but keep only relatively small distances and pointers to "the other" point. Kind of a very sparse matrix of shorter distances. This is straightforward to do in parallel.
Then create data bins that contain groups of points with mutually small distances between them. For example, if you threshold "short" distances in such manner that bins would hold on average, say, 50 data points you'd get 1500000/50=30000 bins.
Then go through your data again and compute distances between bins. That would produce 30000^2 distances, which is a matrix of about 4GB. In addition you still have 30000 with 50^2 distances within bins, which is another 300MB. This amount of data is quite manageable.
If replacing the distance between data points with a distance between the corresponding bins is sufficient precision for your application that would work. It all depends on the kind of data you are dealing with and the precision requirements of your application.

what kind of algorithm for generating height-map from contour line?

I'm looking for interpolating some contour lines to generating a 3D view. The contours are not stored in a picture, coordinates of each point of the contour are simply stored in a std::vector.
for convex contours :
, it seems (I didn't check by myself) that the height can be easily calculates (linear interpolation) by using the distance between the two closest points of the two closest contours.
my contours are not necessarily convex :
, so it's more tricky... actualy I don't have any idea what kind of algorithm I can use.
UPDATE : 26 Nov. 2013
I finished to write a Discrete Laplace example :
you can get the code here
What you have is basically the classical Dirichlet problem:
Given the values of a function on the boundary of a region of space, assign values to the function in the interior of the region so that it satisfies a specific equation (such as Laplace's equation, which essentially requires the function to have no arbitrary "bumps") everywhere in the interior.
There are many ways to calculate approximate solutions to the Dirichlet problem. A simple approach, which should be well suited to your problem, is to start by discretizing the system; that is, you take a finite grid of height values, assign fixed values to those points that lie on a contour line, and then solve a discretized version of Laplace's equation for the remaining points.
Now, what Laplace's equation actually specifies, in plain terms, is that every point should have a value equal to the average of its neighbors. In the mathematical formulation of the equation, we require this to hold true in the limit as the radius of the neighborhood tends towards zero, but since we're actually working on a finite lattice, we just need to pick a suitable fixed neighborhood. A few reasonable choices of neighborhoods include:
the four orthogonally adjacent points surrounding the center point (a.k.a. the von Neumann neighborhood),
the eight orthogonally and diagonally adjacent grid points (a.k.a. the Moore neigborhood), or
the eight orthogonally and diagonally adjacent grid points, weighted so that the orthogonally adjacent points are counted twice (essentially the sum or average of the above two choices).
(Out of the choices above, the last one generally produces the nicest results, since it most closely approximates a Gaussian kernel, but the first two are often almost as good, and may be faster to calculate.)
Once you've picked a neighborhood and defined the fixed boundary points, it's time to compute the solution. For this, you basically have two choices:
Define a system of linear equations, one per each (unconstrained) grid point, stating that the value at each point is the average of its neighbors, and solve it. This is generally the most efficient approach, if you have access to a good sparse linear system solver, but writing one from scratch may be challenging.
Use an iterative method, where you first assign an arbitrary initial guess to each unconstrained grid point (e.g. using linear interpolation, as you suggest) and then loop over the grid, replacing the value at each point with the average of its neighbors. Then keep repeating this until the values stop changing (much).
You can generate the Constrained Delaunay Triangulation of the vertices and line segments describing the contours, then use the height defined at each vertex as a Z coordinate.
The resulting triangulation can then be rendered like any other triangle soup.
Despite the name, you can use TetGen to generate the triangulations, though it takes a bit of work to set up.

Compare similarity algorithms

I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
Jaro,
Jaro-Winkler,
Levenshtein,
Euclidean and
Q-gram,
I wanted to know what is the difference between them and in what situations they work best?
Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.
From Wikipedia, Jaro-Winkler:
In computer science and statistics, the Jaro–Winkler distance
(Winkler, 1990) is a measure of similarity between two strings. It is
a variant of the Jaro distance metric (Jaro, 1989, 1995) and
mainly[citation needed] used in the area of record linkage (duplicate
detection). The higher the Jaro–Winkler distance for two strings is,
the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The
score is normalized such that 0 equates to no similarity and 1 is an
exact match.
Levenshtein distance:
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences. The term edit distance is often used to refer specifically
to Levenshtein distance.
The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with
the allowable edit operations being insertion, deletion, or
substitution of a single character. It is named after Vladimir
Levenshtein, who considered this distance in 1965.
Euclidean distance:
In mathematics, the Euclidean distance or Euclidean metric is the
"ordinary" distance between two points that one would measure with a
ruler, and is given by the Pythagorean formula. By using this formula
as distance, Euclidean space (or even any inner product space) becomes
a metric space. The associated norm is called the Euclidean norm.
Older literature refers to the metric as Pythagorean metric.
And Q- or n-gram encoding:
In the fields of computational linguistics and probability, an n-gram
is a contiguous sequence of n items from a given sequence of text or
speech. The items in question can be phonemes, syllables, letters,
words or base pairs according to the application. n-grams are
collected from a text or speech corpus.
The two core
advantages of n-gram models (and algorithms that use
them) are relative simplicity and the ability to scale up – by simply
increasing n a model can be used to store more context with a
well-understood space–time tradeoff, enabling small experiments to
scale up very efficiently.
The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.
Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.
String similarity helps in a lot of different ways. For example
google's did you mean results are calculated using string similarity.
string similarity is used to correct OCR errors.
string similarity is used to correct keyboard entering errors.
string similarity is used to find most matching sequence of two DNAs in bioinformatics.
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
kitten → sitten
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.
Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"
Therefore you should keep in your mind about your problem.
I want to use string similarity functions to find corrupted data in my database.
How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?