High-dimensional clustering from aggregates of observations

High-dimensional clustering from aggregates of observations - data-mining

I have fallen into this weird high-dimensional clustering problem. Here is an analogy to explain it.
Imagine that 2^10 people enter into a forest, and we want to know how many bird species live there.
These birds differ from each other in, say, 128 dimensions, and all dimensions are binary. That is: either a bird has large beak or small beak, either it has a blue wing or it doesn't, etc. (Each bird species can be represented by 128 bits)
My problem is that when the guys get off the forest, we only have the aggregates of their observations:
"I saw 8 birds, 3 had blue beaks (5 didn't), 4 had blue wings (4 didn't), 1 had a large beak (7 didn't), etc". They do not report on the individual characteristics of their observations, but only on the aggregates of their observations.
There are two additional constraints:
i) all species are observed at least once;
ii) The number of species is small (~2^5).
Of course, we can compile the aggregate of their aggregates (of 3000 observations, 357 birds had large beaks, etc..). But what about the clusters?
So the questions are:
How can we find out how many species live there?
How can we find out the characteristics of each species?

Since 2^128 = 340282366920938463463374607431768211456, you would need a pretty high sample size to draw valid conclusions. Every bird observed could easily be unique.

If x an aggregate observation of a set of birds by a person, then you can approximate it by the matrix product Dz where D is a matrix whose columns represent characteristics of individual birds, and z is a vector of the counts of each bird.
If you assume that only a small number of birds are observed, then this acts as a constraint on the magnitude of z.
This problem is very similar to the sparse dictionary learning problem.
Here are a couple of links that both describe sparse dictionary learning (and related problems) and provide software to solve it: http://spams-devel.gforge.inria.fr/ and http://www.ux.uis.no/~karlsk/dle/index.html

Related

Algorithm Design: Best Way to Represent a 2D Grid, with Boundary Digits, in C++?

I like working on algorithms in my spare time to improve my algorithm design skills. I tackle Jane Street's monthly puzzles as my 'monthly challenge'. I've previously developed algorithms to solve their October Puzzle, and have solved their November puzzle by hand.
I solved their November puzzle (Hooks #6) by hand, but only because I'm not sure how to solve it (and future puzzles) that involve a grid with a numbered border, computationally. I'm not sure how I'd go about setting the foundation this type of problem.
For instance, many of their problems involve a 2D grid with numbers on the border of the grid. Furthermore, a recurring theme is that whatever is in the grid must meet multiple conditions that involve looking at that number from different sides of the grid. For example, if I have the following 2 by 2 grid, with 4 numbers outside its boundaries,
_ _
5| | 45
5|_ _| 15
Place four numbers in the grid such that, when you
look at the grid from the left, at least one number
in that row is the border number.
In the case of the top left of the 2 by 2 grid,
looking at it from the left means the number 5 must be in either (0,0) or (0,1).
In addition, when looking at that row from the right, the product
of the numbers in the row must equal the boundary number on the right.
In the case of the top right of the 2 by 2 grid,
looking at it from the right means the number 9 must be in either (0,0)
or (0,1), as 9 * 5 = 45.
Hence, the first row in the 2 by 2 grid can either be 5 and 9, or 9 and 5.
One of the solutions for this problem, by hand, is
(0,0) = 5, (0,1) = 9, (1,0) = 5, (1,1) = 3
but how can I go about this computationally?
How can I go about translating these grid-like problems with differing conditions based on the position one "looks" at the grid into code?
Thanks!

I'm not convinced these puzzles were meant to be solved via code. They seem idiosyncratic and complicated enough that coding them would be time-consuming.
That said, the November puzzle in particular seems to have rather limited options for "fixing" a number placement. I would consider a backtracking algorithm that keeps a complete board state and has ready methods that evaluate if a particular row or column is not breaking a rule, as well as the "free square" rule.
Then try each possible placement of the numbers given by the black indicators, ordered -- there aren't that many, considering they concatenate squares -- and call the evaluation on the affected rows and columns. Given the constraints, I think wrong branches would likely terminate quickly.
It seems to me it's more or less the best we can do since there don't seem to be clear heuristics to indicate a branch is more likely to succeed.

If you are looking for a data structure to represent one filled grid, I would recommend a struct row containing numbers left and right and a std::vector of numbers. A grid would be a vector of rows. You can write methods that allow you to pass it functions that check conditions on the rows.
But solving these problems in a generic way seems complicated to me. Different rules can mean very different approaches to solving them. Of course, if the instances are always this small, one can probably just try all (reasonable) possible fillings of the grid. But this will very fast become infeasible.
You can maybe implement somewhat generic algorithms, if there are rules that are similar to each other. For example a fixed value for the sum of all numbers in a row is a very similar problem to having a fixed value for the product.
But without constraining the possible rules and finding some similarities in them, you will have to write specific solver code for each and every rule.

How to find 2 lines given only points?

I have this image:
Humans can tell that two lines can be fitted through the points. A naive algorithm would put a horizontal best fit line. Is there an algorithm that best fits a series of points while ignoring distant outliers?

There are robust estimation techniques to fit a model to noisy data, such as RANSAC. You would need to fit one line, exclude all the points that belong to that line, and the fit the second line to the remaining points.

Straight from David Forsyth web page (author of the book: Forsyth, David A. and Jean Ponce (2002). Computer Vision: A Modern Approach. Prentice
Hall Professional Technical Reference) the following is Algorithm 15.2:
Hypothesize k lines (perhaps uniformly at random)
or
hypothesize an assignment of lines to points and then fit lines using
this assignment
Until convergence
allocate each point to the closest line
refit lines
end
In your case k is 2.

The Hough transform is suitable for this task. Basically, each point votes for the existence of all lines that pass through it (in a line-parameter-space, e.g. rho-theta for distance from origin and angle). If the parameter space is discretized, then you'll get peaks for each of the lines present in your data. The outliers will have voted for parameters that have little votes from other points, so they will have low count in the parameter-space.
The image below (from Wikipedia) illustrates the concept in the ideal case (the points actually lie on exact lines). With read data, the peaks will be fuzzier, but you'll still be able to distinguish them from the outliers. The pros of this method is that you do not have to hypothesize how many lines there are, and it works well for many types of images/data. The cons are that it may fail if there are many non-linear distractors, such as in natural scenes containing many curved objects.

Algorithm to assemble a simplified jigsaw puzzle where all edges are identified

Are there any kind of algorithms out there that can assist and accelerate in the construction of a jigsaw puzzle where the edges are already identified and each edge is guaranteed to fit exactly one other edge (or no edges if that piece is a corner or border piece)?
I've got a data set here that is roughly represented by the following structure:
struct tile {
int a, b, c, d;
};
tile[SOME_LARGE_NUMBER] = ...;
Each side (a, b, c, and d) is uniquely indexed within the puzzle so that only one other tile will match an edge (if that edge has a match, since corner and border tiles might not).
Unfortunately there are no guarantees past that. The order of the tiles within the array is random, the only guarantee is that they're indexed from 0 to SOME_LARGE_NUMBER. Likewise, the side UIDs are randomized as well. They all fall within a contiguous range (where the max of that range depends on the number of tiles and the dimensions of the completed puzzle), but that's about it.
I'm trying to assemble the puzzle in the most efficient way possible, so that I can ultimately address the completed puzzle using rows and columns through a two dimensional array. How should I go about doing this?

The tile[] data defines an undirected graph where each node links with 2, 3 or 4 other nodes. Choose a node with just 2 links and set that as your origin. The two links from this node define your X and Y axes. If you follow, say, the X axis link, you will arrive at a node with 3 links — one pointing back to the origin, and two others corresponding to the positive X and Y directions. You can easily identify the link in the X direction, because it will take you to another node with 3 links (not 4).
In this way you can easily find all the pieces along one side until you reach the far corner, which only has two links. Of all the pieces found so far, the only untested links are pointing in the Y direction. This makes it easy to place the next row of pieces. Simply continue until all the pieces have been placed.

This might be not what you are looking for, but because you asked for "most efficient way possible", here is a relatively recent scientific solution.
Puzzles are a complex combinatorial problem (NP-complete) and require some help from Academia to solve them efficiently. State of the art algorithms was recently beaten by genetic algorithms.
Depending on your puzzle sizes (and desire to study scientific stuff ;)) you might be interested in this paper: A Genetic Algorithm-Based Solver for Very Large Jigsaw Puzzles . GAs would work around in surprising ways some of the problems you encounter in classic algorithms.
Note that genetic algorithms are embarrassingly parallel, so there is a straightforward way to do calculations on parallel machines, such as multi-core CPUs, GPUs (CUDA/OpenCL) and even distributed/cloud frameworks. Which makes them hundreds to thousands times faster. GPU-accelerated GAs unlock puzzle sizes unavailable for conventional algorithms.

Fast adding random variables in C++

Short version: how to most efficiently represent and add two random variables given by lists of their realizations?
Mildly longer version:
for a workproject, I need to add several random variables each of which is given by a list of values. For example, the realizations of rand. var. A are {1,2,3} and the realizations of B are {5,6,7}. Hence, what I need is the distribution of A+B, i.e. {1+5,1+6,1+7,2+5,2+6,2+7,3+5,3+6,3+7}. And I need to do this kind of adding several times (let's denote this number of additions as COUNT, where COUNT might reach 720) for different random variables (C, D, ...).
The problem: if I use this stupid algorithm of summing each realization of A with each realization of B, the complexity is exponential in COUNT. Hence, for the case where each r.v. is given by three values, the amount of calculations for COUNT=720 is 3^720 ~ 3.36xe^343 which will last till the end of our days to calculate:) Not to mention that in real life, the lenght of each r.v. is gonna be 5000+.
Solutions:
1/ The first solution is to use the fact that I am OK with rounding, i.e. having integer values of realizations. Like this, I can represent each r.v. as a vector and for at the index corresponding to a realization I have a value of 1 (when the r.v. has this realization once). So for a r.v. A and a vector of realizations indexed from 0 to 10, the vector representing A would be [0,1,1,1,0,0,0...] and the representation for B would be [0,0,0,0,0,1,1,1,0,0,10]. Now I create A+B by going through these vectors and do the same thing as above (sum each realization of A with each realization of B and codify it into the same vector structure, quadratic complexity in vector length). The upside of this approach is that the complexity is bound. The problem of this approach is that in real applications, the realizations of A will be in the interval [-50000,50000] with a granularity of 1. Hence, after adding two random variables, the span of A+B gets to -100K, 100K.. and after 720 additions, the span of SUM(A, B, ...) gets to [-36M, 36M] and even quadratic complexity (compared to exponential complexity) on arrays this large will take forever.
2/ To have shorter arrays, one could possibly use a hashmap, which would most likely reduce the number of operations (array accesses) involved in A+B as the assumption is that some non-trivial portion of the theoreical span [-50K, 50K] will never be a realization. However, with continuing summing of more and more random variables, the number of realizations increases exponentially while the span increases only linearly, hence the density of numbers in the span increases over time. And this would kill the hashmap's benefits.
So the question is: how can I do this problem efficiently? The solution is needed for calculating a VaR in electricity trading where all distributions are given empirically and are like no ordinary distributions, hence formulas are of no use, we can only simulate.
Using math was considered as the first option as half of our dept. are mathematicians. However, the distributions that we're going to add are badly behaved and the COUNT=720 is an extreme. More likely, we are going to use COUNT=24 for a daily VaR. Taking into account the bad behaviour of distributions to add, for COUNT=24 the central limit theorem would not hold too closely (the distro of SUM(A1, A2, ..., A24) would not be close to normal). As we're calculating possible risks, we'd like to get a number as precise as possible.
The intended use is this: you have hourly casflows from some operation. The distribution of cashflows for one hour is the r.v. A. For the next hour, it's r.v. B, etc. And your question is: what is the largest loss in 99 percent of cases? So you model the cashflows for each of those 24 hours and add these cashflows as random variables so as to get a distribution of the total casfhlow over the whole day. Then you take the 0.01 quantile.

Try to reduce the number of passes required to make the whole addition, possibly reducing it to a single pass for every list, including the final one.
I don't think you can cut down on the total number of additions.
In addition, you should look into parallel algorithms and multithreading, if applicable.
At this point, most processors are able to perform additions in parallel, given proper instrucions (SSE), which will make the additions many times faster(still not a cure for the complexity problem).

As you said in your question, you're going to need an awful lot of computation to get the exact answer. So it's not going to happen.
However, as you're dealing with random values, it would be possible to apply some mathmatics to the problem. Wouldn't the result of all these additions result in something that approaches the normal distribution? For example, consider rolling a single dice. Each number has equal probability so the realisations don't follow a normal distribution (actually, they probably do, there was a program on BBC4 last week about it and it showed that lottery balls had a normal distribution to their appearance). However, if you roll two dice and sum them, then the realisations do follow a normal distribution. So I think the result of your computation is going to approximate a normal distribution so it becomes a problem of finding the average value and the sigma value for a given set of inputs. You can workout the upper and lower bounds for each input as well as their averages and I'm sure a bit of Googling will provide methods for applying functions to normal distributions.
I guess there is a corollary question and that is what the results are used for? Knowing how the results are used will inform the decision on how the results are created.

Ignoring the programmatic solutions, you can cut down the total number of additions quite significantly as your data set grows.
If we define four groups W, X, Y and Z, each with three elements, by your own maths this leads to a large number of operations:
W + X => 9 operations
(W + X) + Y => 27 operations
(W + X + Y) + Z => 81 operations
TOTAL: 117 operations
However, if we assume a strictly-ordered definition of your "add" operation so that two sets {a,b} and {c,d} always result in {a+c,a+d,b+c,b+d} then your operation is associative. That means that you can do this:
W + X => 9 operations
Y + Z => 9 operations
(W + X) + (Y + Z) => 81 operations
TOTAL: 99 operations
This is a saving of 18 operations, for a simple case. If you extend the above to 6 groups of 3 members, the total number of operations can be dropped from 1089 to 837 - almost 20% saving. This improvement is more pronounced the more data you have (more sets or more elements will give more savings).
Further, this opens the problem to better parallelisation: if you have 200 groups to process, you can start by combining the 100 pairs in parallel, then the 50 pairs or results, then 25, etc. This will allow a large degree of parallelism that should give you much better performance. (For example, 720 sets would be added in ~10 parallel operations as each parallel add will allow increasing COUNT by a factor of 2.)
I'm absolutely no expert on this, but it would seem an ideal problem for using the parallel procesing capability of a typical GPU - my understanding is that something like CUDA would make short work of processing all these calculations in parallel.
EDIT: If your real question is "what's your largest loss" then this is a much easier problem. Given that every value in the ultimate set is the sum of one value from each "component" set, your biggest loss will generally be found by combining the lowest value from each component set. Finding these lower values (one value per set) is a much simpler job, and you then only need sum together that limited set of values.

There are basically two methods. An approximative one and an exact one...
Approximative method models the sum of random variables by a lot of samplings. Basically, having random variables A, B we randomly sample from each r.v. 50K times, add the sampled values (here SSE can help a lot) and we have a distribution of A+B. This is how mathematicians would do this in Mathematica.
Exact method utilizes something Dan Puzey proposed, namely summing only some small portion of each r.v.'s density. Let's say we have random variables with the following "densities" (where each value is of the same likelihood for simplicity sake)
A = {-5,-3,-2}
B = {+0,+1,+2}
C = {+7,+8,+9}
The sum of A+B+C is going to be
{2,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9}
and if I want to know the whole distribution precisely, I have no other choice than summing each elem of A with each elem of B and then each elem of this sum with each elem of C. However, if I only want the 99% VaR of this sum, i.e. 1% percentile of this sum, I only have to sum the smallest elements of A,B,C.
More precisely, I will take nA,nB,nC smallest elements from each distribution. To determine nA,nB,nC let's set these to 1 first. Then, increase nA by one if A[nA] = min( A[nA], B[nB], C[nC]) (counting on that A,B,C are sorted). This way, I can get the nA, nB, nC smallest elements of A,B,C which I will have to sum together (each with each other) and take the X-th smallest sum (where X is 1% multiplied by total combination count of sums, i.e. 3*3*3 for A,B,C). This also tells when to stop increasing nA,nB,nC - stop when nA*nB*nC > X.
However, like this I am doing the same redundancy again, i.e. I am calculating the whole distribution of A+B+C left of the 1% percentile. Even this will be MUCH shorter than calculating the whole distro of A+B+C, however. But I believe there should be a simple iterative algo to tell exaclty the the given VaR number in O(a*b) where a is the number of added r.v.s and b is the max number of elements in the density of each r.v.
I will be glad for any comments on whether I am correct.

What is exactly a measure in star schema in data warehouse design?

Star schema consists of dimension and fact tables.
Fact tables contain foreign keys for each dimension and in addition to that, it contains "measures". What exactly comprises of this measure ?
Is it some aggregate function's answer that is stored ?

Basically yes.
If you had a simple grid
Salary Januari Februari March April May June
Q1 Q2
Me 1100 1100 1100 1100 1500 1500
Collegue1 2000 2000 2000 0 0 0
Time is a hierarchical dimension with two levels (shown).
The other dimension shown is 'EmployeeID'. The other dimension (not shown) could be in the PointOfView (e.g. Budget/Actual).
The Amount (1100, e.g.) is the Measure and it constitutes your facts (the non-identifying parts of the facts). The dimensions define consolidation functions for each measure on the various levels (E.g. Amount(Q1) == SUM(Amount(January...March))). Note that the consolidation will behave differently depending on the measure (e.g. the income tax % will not be summed, but somehow consolidated: how exactly is the art of OLAP Cube design).
(trivia: you can have calculated measures, that use MDX to query e.g. the deviation of Amount in comparison the the preceding Quarter, the Average salary acoss the whole quarter etc.; it will be pretty clear that again, the consolidation formulas require thought).
At this point you will start to see that designing the consolidation rules depends on the order in which the rules are calculated (if the formula for 'salary deviation %' is is evaluated FIRST and then consolidated, you need to average it; however if the raw SALARY measure is consolidated (summed) to the Q1,Q2 level first, then the derived Measure can be calculated like it was at the lowest level.
Now things become more fun when deciding how to store the cube. Basically two ways exist:
precalculate all cells (including all consolidations in all scenarios)
calculate on the fly
It won't surprise anyone that most OLAP engines have converged on hybrid methods (HOLAP), where significant parts of frequently accessed consolidation levels are pre-calculated and stored, and other parts are calculated on the fly.
Some will store the underlying data in a standard RDBMS (ROLAP) other won't (OLAP). The engines focused on high performance tend to keep all data in precalculated cubes (only resorting to 'many small sub-cubes' for very sparse dimensions).
Well, anywyas, this was a bit of a rant. I liked rambling off what I once learned when doing datawarehousing and OLAP

Fact and measure are synonyms afaik. Facts are data: sales, production, deliveries, etc. Dimensions are information tied to the fact (time, location, department).

Measures are one of two kinds of things.
Measures. Measurements. Numbers with units. Dollars, weights, volumes, sizes, etc. Measurements.
Aggregates. Sums (or sometimes averages) of data. It might be data in the warehouse: pre-computed aggregates for performance reasons. Or it might be data that can't be acquired (or isn't needed) because it's too detailed. Too high volume or something.
The most important thing about a fact table is that the non-key measures are actual measurements with units.

If it would be an adjacent tree model it would be the title-field or any other field that contains the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js