I am implementing this clustering algorithm http://www.sciencemag.org/content/344/6191/1492.full (free access version) in C in my software and I need to build a distance matrix, but in some cases, the size of the dataset (after redundancy removal) is huge (n > 1 500 000 and it is even larger, going up to 4 000 000 on more complex cases). My problem is, even allocating the upper triangular matrix would be ( (1500000*1500000) - 1500000) * 0.5 * sizeof(float) =~ 5.5e12 Bytes. So, memory allocation fails (even on our computing nodes with 256 GB of RAM) and writing to disk is not an option in this case.
Beside cutting down the size (which I will look) of the dataset to cluster, anybody has an idea of a technique I could use to approximate and store this amount of information ?
N.B. Like I said in the title, I am using C and I can also use C++. Also, if anybody has another clustering algorithm (where the number of clusters is determined with the algorithm itself) to use, please suggest it to me.
Thanks in advance for your time,
You probably have to step back and reconsider your algorithm.
First, perhaps you don't need to have distance matrix between all pairs of data points. Perhaps you could group together similar data points into data bins and then create a matrix of distances between bins.
That is, start by computing pairwise distances between points, but keep only relatively small distances and pointers to "the other" point. Kind of a very sparse matrix of shorter distances. This is straightforward to do in parallel.
Then create data bins that contain groups of points with mutually small distances between them. For example, if you threshold "short" distances in such manner that bins would hold on average, say, 50 data points you'd get 1500000/50=30000 bins.
Then go through your data again and compute distances between bins. That would produce 30000^2 distances, which is a matrix of about 4GB. In addition you still have 30000 with 50^2 distances within bins, which is another 300MB. This amount of data is quite manageable.
If replacing the distance between data points with a distance between the corresponding bins is sufficient precision for your application that would work. It all depends on the kind of data you are dealing with and the precision requirements of your application.
I've been doing some operations with 2D points expressed in homogeneous coordinates (x, y, w). Sometimes one of the coordinates becomes very large, and this can easily affect subsequent results.
For example, determining intersections can be calculated easily with a vector x-product. This can produce large numbers. Eg. (50, 100, 1) x (-100, 50, 1) = (50, -150, 12500)
I feel these results should be somehow normalised. In the example above, simply by dividing all coordinates by 12500 seems sensible. In general I can see 2 ways:
divide by the coordinate with the largest absolute value (may not be w), or
divide by w (if w != 0) so that every point is expressed as either (x, y, 0) or (x, y, 1).
So my question is, which way is better and why?
I'm using c# with float values, if that's of any practical relevance.
Downscaling by the max absolute value of the components is the safer option among all correct ones - you never have to worry about overflows as long as the max > 1.0. Dividing by w is only required to convert to Euclidean points.
Incidentally, working with floats is rarely a good idea in internal calculations, although it may make sense for storing final results that correspond to physical quantities you measure. When doing geometrical calculations, in my experience, truncation errors often propagate fast enough, even with well conditioned algorithms, to make calculated results in single precision rather worthless.
right now I am stuck solving the following "semi"-mathematical Problem.
I would like to partition an n-dimensinal restricted space (a hypercube to be precise)
D={(x_1, ...,x_n), x_i \in IR and -limits<=x_i<=limits \forall i<=n} Into smaller cubes.
Meaning I would like to specify n,limits,m where m would be the number of partitions per side of the cube - 2*limits/m would be the length of the small cubes and I would get m^n such cubes.
Now I would like to return a vector of vectors containing some distinct coordinates of these small cubes. (or perhaps one could represent the cubes as objects which are characterized by a vector pointing to the "left" outer corner ? )
Basically I have no idea whether something like that is even doable using C++. Implementing this for fixed n does not pose a problem. But I would like to enable the user to have free choice of the dimension.
Background: Something like that would be priceless in optimization. Where one would partition the space into smaller ones and use e.g. a genetic algorithms on each of the subspaces and later compare the results. Thus huge initial Populations could be avoided and the search results drastically improved.
Also I am just curious whether sth. like that is doable :)
My Suggestion: Use B+ Trees ?
Let m be the number of partitions per dimension, i.e. per edge, of the hypercube D.
Then there are m^n different subspaces S of D, like you say. Let the subspaces S be uniquely represented by integer coordinates S=[y_1,y_2,...,y_n] where the y_i are integers in the range 1, ..., m. In Cartesian coordinates, then, S=(x_1,x_2,...,x_n) where Delta*(y_i-1)-limits <= x_i < Delta*y_i-limits, and Delta=2*limits/m.
The "left outer corner" or origin of S you were looking for is just the point corresponding to the smallest x_i, i.e. the point (Delta*(y_1-1)-limits, ..., Delta*(y_n-1)-limits). Instead of representing the different S by this point, it makes a lot more sense (and will be faster in a computer) to represent them using the integer coordinates above.
Short version: how to most efficiently represent and add two random variables given by lists of their realizations?
Mildly longer version:
for a workproject, I need to add several random variables each of which is given by a list of values. For example, the realizations of rand. var. A are {1,2,3} and the realizations of B are {5,6,7}. Hence, what I need is the distribution of A+B, i.e. {1+5,1+6,1+7,2+5,2+6,2+7,3+5,3+6,3+7}. And I need to do this kind of adding several times (let's denote this number of additions as COUNT, where COUNT might reach 720) for different random variables (C, D, ...).
The problem: if I use this stupid algorithm of summing each realization of A with each realization of B, the complexity is exponential in COUNT. Hence, for the case where each r.v. is given by three values, the amount of calculations for COUNT=720 is 3^720 ~ 3.36xe^343 which will last till the end of our days to calculate:) Not to mention that in real life, the lenght of each r.v. is gonna be 5000+.
Solutions:
1/ The first solution is to use the fact that I am OK with rounding, i.e. having integer values of realizations. Like this, I can represent each r.v. as a vector and for at the index corresponding to a realization I have a value of 1 (when the r.v. has this realization once). So for a r.v. A and a vector of realizations indexed from 0 to 10, the vector representing A would be [0,1,1,1,0,0,0...] and the representation for B would be [0,0,0,0,0,1,1,1,0,0,10]. Now I create A+B by going through these vectors and do the same thing as above (sum each realization of A with each realization of B and codify it into the same vector structure, quadratic complexity in vector length). The upside of this approach is that the complexity is bound. The problem of this approach is that in real applications, the realizations of A will be in the interval [-50000,50000] with a granularity of 1. Hence, after adding two random variables, the span of A+B gets to -100K, 100K.. and after 720 additions, the span of SUM(A, B, ...) gets to [-36M, 36M] and even quadratic complexity (compared to exponential complexity) on arrays this large will take forever.
2/ To have shorter arrays, one could possibly use a hashmap, which would most likely reduce the number of operations (array accesses) involved in A+B as the assumption is that some non-trivial portion of the theoreical span [-50K, 50K] will never be a realization. However, with continuing summing of more and more random variables, the number of realizations increases exponentially while the span increases only linearly, hence the density of numbers in the span increases over time. And this would kill the hashmap's benefits.
So the question is: how can I do this problem efficiently? The solution is needed for calculating a VaR in electricity trading where all distributions are given empirically and are like no ordinary distributions, hence formulas are of no use, we can only simulate.
Using math was considered as the first option as half of our dept. are mathematicians. However, the distributions that we're going to add are badly behaved and the COUNT=720 is an extreme. More likely, we are going to use COUNT=24 for a daily VaR. Taking into account the bad behaviour of distributions to add, for COUNT=24 the central limit theorem would not hold too closely (the distro of SUM(A1, A2, ..., A24) would not be close to normal). As we're calculating possible risks, we'd like to get a number as precise as possible.
The intended use is this: you have hourly casflows from some operation. The distribution of cashflows for one hour is the r.v. A. For the next hour, it's r.v. B, etc. And your question is: what is the largest loss in 99 percent of cases? So you model the cashflows for each of those 24 hours and add these cashflows as random variables so as to get a distribution of the total casfhlow over the whole day. Then you take the 0.01 quantile.
Try to reduce the number of passes required to make the whole addition, possibly reducing it to a single pass for every list, including the final one.
I don't think you can cut down on the total number of additions.
In addition, you should look into parallel algorithms and multithreading, if applicable.
At this point, most processors are able to perform additions in parallel, given proper instrucions (SSE), which will make the additions many times faster(still not a cure for the complexity problem).
As you said in your question, you're going to need an awful lot of computation to get the exact answer. So it's not going to happen.
However, as you're dealing with random values, it would be possible to apply some mathmatics to the problem. Wouldn't the result of all these additions result in something that approaches the normal distribution? For example, consider rolling a single dice. Each number has equal probability so the realisations don't follow a normal distribution (actually, they probably do, there was a program on BBC4 last week about it and it showed that lottery balls had a normal distribution to their appearance). However, if you roll two dice and sum them, then the realisations do follow a normal distribution. So I think the result of your computation is going to approximate a normal distribution so it becomes a problem of finding the average value and the sigma value for a given set of inputs. You can workout the upper and lower bounds for each input as well as their averages and I'm sure a bit of Googling will provide methods for applying functions to normal distributions.
I guess there is a corollary question and that is what the results are used for? Knowing how the results are used will inform the decision on how the results are created.
Ignoring the programmatic solutions, you can cut down the total number of additions quite significantly as your data set grows.
If we define four groups W, X, Y and Z, each with three elements, by your own maths this leads to a large number of operations:
W + X => 9 operations
(W + X) + Y => 27 operations
(W + X + Y) + Z => 81 operations
TOTAL: 117 operations
However, if we assume a strictly-ordered definition of your "add" operation so that two sets {a,b} and {c,d} always result in {a+c,a+d,b+c,b+d} then your operation is associative. That means that you can do this:
W + X => 9 operations
Y + Z => 9 operations
(W + X) + (Y + Z) => 81 operations
TOTAL: 99 operations
This is a saving of 18 operations, for a simple case. If you extend the above to 6 groups of 3 members, the total number of operations can be dropped from 1089 to 837 - almost 20% saving. This improvement is more pronounced the more data you have (more sets or more elements will give more savings).
Further, this opens the problem to better parallelisation: if you have 200 groups to process, you can start by combining the 100 pairs in parallel, then the 50 pairs or results, then 25, etc. This will allow a large degree of parallelism that should give you much better performance. (For example, 720 sets would be added in ~10 parallel operations as each parallel add will allow increasing COUNT by a factor of 2.)
I'm absolutely no expert on this, but it would seem an ideal problem for using the parallel procesing capability of a typical GPU - my understanding is that something like CUDA would make short work of processing all these calculations in parallel.
EDIT: If your real question is "what's your largest loss" then this is a much easier problem. Given that every value in the ultimate set is the sum of one value from each "component" set, your biggest loss will generally be found by combining the lowest value from each component set. Finding these lower values (one value per set) is a much simpler job, and you then only need sum together that limited set of values.
There are basically two methods. An approximative one and an exact one...
Approximative method models the sum of random variables by a lot of samplings. Basically, having random variables A, B we randomly sample from each r.v. 50K times, add the sampled values (here SSE can help a lot) and we have a distribution of A+B. This is how mathematicians would do this in Mathematica.
Exact method utilizes something Dan Puzey proposed, namely summing only some small portion of each r.v.'s density. Let's say we have random variables with the following "densities" (where each value is of the same likelihood for simplicity sake)
A = {-5,-3,-2}
B = {+0,+1,+2}
C = {+7,+8,+9}
The sum of A+B+C is going to be
{2,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9}
and if I want to know the whole distribution precisely, I have no other choice than summing each elem of A with each elem of B and then each elem of this sum with each elem of C. However, if I only want the 99% VaR of this sum, i.e. 1% percentile of this sum, I only have to sum the smallest elements of A,B,C.
More precisely, I will take nA,nB,nC smallest elements from each distribution. To determine nA,nB,nC let's set these to 1 first. Then, increase nA by one if A[nA] = min( A[nA], B[nB], C[nC]) (counting on that A,B,C are sorted). This way, I can get the nA, nB, nC smallest elements of A,B,C which I will have to sum together (each with each other) and take the X-th smallest sum (where X is 1% multiplied by total combination count of sums, i.e. 3*3*3 for A,B,C). This also tells when to stop increasing nA,nB,nC - stop when nA*nB*nC > X.
However, like this I am doing the same redundancy again, i.e. I am calculating the whole distribution of A+B+C left of the 1% percentile. Even this will be MUCH shorter than calculating the whole distro of A+B+C, however. But I believe there should be a simple iterative algo to tell exaclty the the given VaR number in O(a*b) where a is the number of added r.v.s and b is the max number of elements in the density of each r.v.
I will be glad for any comments on whether I am correct.
I'm given m places (x,y coordinates).
I have n requests of finding the closest place to a given point P(x,y); (The minimum Euclidian distance)
How can i solve this problem below O(n*m) where n is the number of requests and m the number of places? I could use squared Euclidian distances but it's still n*m.
Try a kd-tree. A high performance library implementation can be found here.
Note: I'm pointing you to an approximate nearest-neighbors search which is optimized for high dimensions. This may be slightly overkill for your application.
Edit:
For a 2d kd-tree, the build time would be O(m*log(m)) and the query time would be O(n*sqrt(m)). This should end up being a net win over the naive solution if your number of queries n, exceeds log(m).
The library means you don't have to implement it so the complexity shouldn't be an issue.
If you want to generalize to high dimension extremely fast querying, check out locality sensitive hashing.
Interesting. To reduce the effect of n, I wonder if perhaps it would help to save the result of each request as you encounter and handle it. A clever result table might shortcut the need to calculate sqrt( x2 + y2) in solving subsequent requests.
The Nearest-Neighbor-Problem, eh? I found Robert Sedgewick Std Book very useful in these cases. He describes Nearest Neighbour Search, too.