Data mining, curse of dimensionality

Data mining, curse of dimensionality - data-mining

Let's say I have a dataset represented as a matrix X, of dimensions n x m, with m high. I would like to quickly reduce the dimension m such that the objects in each row in X preserve distances to each other. One way to accomplish this would be to:
create a mapping matrix A by initializing it with all 0's and then randomly choosing 1/6 of its values to be +1 and 1/6 of its values to be -1, and then multiply X by A.
Am I right or wrong?

If you are preserving the distances, then you are also preserving the curse of dimensionality. The distances will still all be too similar to be useful...

Related

Understanding Curve Global Approximation algorithm

Problem description
I am trying to understand and implement the Curve Global Approximation, as proposed here:
https://pages.mtu.edu/~shene/COURSES/cs3621/NOTES/INT-APP/CURVE-APP-global.html
To implement the algorithm it is necessary to calculate base function coefficients, as described here:
https://pages.mtu.edu/~shene/COURSES/cs3621/NOTES/spline/B-spline/bspline-curve-coef.html
I have trouble wrapping my head around some of the details.
First there is some trouble with variable nomenclature. Specifically I am tripped up by the fact there is as function parameter as well as input and . Currently I assume, that first I decide how many knot vectors I want to find for my approximation. Let us say I want 10. So then my parameters are:
I assume this is what is input parameter in the coefficient calculation algorithm?
The reason this tripped me up is because of the sentence:
Let u be in knot span
If input parameter was one of the elements of the knot vector , then there was no need for an interval. So I assume is actually one of these elements ( ?), defined earlier:
Is that assumption correct?
Most important question. I am trying to get my N to work with the first of the two links, i.e. the implementation of the Global Curve Approximation. As I look at the matrix dimensions (where P, Q, N dimensions are mentioned), it seems that N is supposed to have n rows and h-1 columns. That means, N has rows equal to the amount of data points and columns equal to the curve degree minus one. However when I look at the implementation details of N in the second link, an N row is initialized with n elements. I refer to this:
Initialize N[0..n] to 0; // initialization
But I also need to calculate N for all parameters which correspond to my parameters which in turn correspond to the datapoints. So the resulting matrix is of ddimension ( n x n ). This does not correspond to the previously mentioned ( n x ( h - 1 ) ).
To go further, in the link describing the approximation algorithm, N is used to calculate Q. However directly after that I am asked to calculate N which I supposedly already had, how else would I have calculated Q? Is this even the same N? Do I have to calculate a new N for the desired amount of control points?
Conclusion
If somebody has any helpful insight on this - please do share. I aim to implement this using C++ with Eigen for its usefulness w.r.t. to solving M * P = Q and matrix calculations. Currently I am at a loss though. Everything seems more or less clear, except for N and especially its dimensions and whether it needs to be calculated multiple times or not.
Additional media
In the last image it is supposed to say, "[...] used before in the calculation of Q"

The 2nd link is telling you how to compute the basis function of B-spline curve at parameter u where the B-spline curve is defined by its degree, knot vector [u0,...um] and control points. So, for your first question, if you want to have 10 knots in your knot vector, then the typical knot vector will look like:
[0, 0, 0, 0, 0.3, 0.7, 1, 1, 1, 1]
This will be a B-spline curve of degree 3 with 6 control points.
For your 2nd question, The input parameter u is generally not one of the knots [u0, u1,...um]. Input parameter u is simply the parameter we would like to evaluate the B-spline curve at. The value of u actually varies from 0 to 1 (assuming the knot vector ranges is also from 0 to 1).
For your 3rd questions, N (in the first link) represents a matrix where each element of this matrix is a Ni,p(tj). So, basically the N[] array computed from 2nd link is actually a row vector of the matrix N in the first link.
I hope my answers have cleared out some of your confusions.

Transposing dimensions after a reshape: when it is required?

tf.reshape can change the dimension of a tensor data structure, but is there any guarantee about how will be the data ordered on each dimension?
suppose I have a tensor with dimensions A[objects, x, t] that describes some value across different objects, positions and time, and another that is just across B[x,t]. I usually have to do reshapes in order to copy B over dimensions 0 like this:
B_res = tf.tile( tf.reshape( B , [1, X_SIZE, T_SIZE]), tf.pack([OBJECT_SIZE,1,1]))
some_op = tf.mul( A, B_res )
The problem I see is when X_SIZE == T_SIZE reshape does not have any way to know how to arrange B data in the reshape, for all that I know, it could be aligning data along dimension 0 of B along dimension 2 of B_res!
Is there any rules for how reshape orders data? I want to know if a few tf.transpose operations are required after a certain tf.reshape

In memory arrays / tensors are really just 1D objects. So say you need to store a 10x10 array. In memory this would just be a 1D array of length 100 where the first 10 elements correspond to the first row. This is known as row major ordering. Say you want to reshape it to the shape 20x5. The underlying memory is not changed by the reshape so the first ten elements now make up rows 1 and 2. Transpose on the other hand physically reorders the memory to maintain the row major ordering while changing the location of the dimensions.
Also I think you are tiling unessarily. You should read up on broadcasting. I think you could do the following.
some_op = A * tf.reshape(B, [1, X_SIZE, T_SIZE])
In this case it will automatically broadcast B along the first dimension of A.
Note. I am not actually sure if tensorflow is using row major or column major ordering but the same concepts still apply.

What exactly happens after run Algorithm PCA (Principal Component Analysis)

At this moment I'm working with an image processing project. But I have a conceptual questions regarding the PCA.
What exactly happens to the matrix after applying the PCA in the matrix of an image?
I does not understand reading the literature on this subject.
Given an M x N matrix, the result is an matrix M' x N' and where M'< M and N' < N and M' x N' is proporcional to M x N?

I'm no expert in PCA but I'll try to explain what I understand.
After apply PCA to the matrix of an image, you get the eigenvectors of that matrix, which represents a number invariant axis of the matrix. These vectors are all orthogonal to each other.
By measuring how disperse are your original data in matrix along these vectors, you can know how are they distributed. This can be useful, for example, if you wish to perform pattern categorization based on how the data are distributed along these "axis".
Although not accurate, you can imagine PCA helps you to draw "axis" along the blob of data that presence in your matrix, where the new "origin" of the axis are the center of your data.
The best part is that the data are distributed the most along the first eigenvector, follow by the second eigenvector and so on.
I hope I did not confused you.
There are a number of good reference about PCA in quora in addition to stackoverflow too.
Here are a few examples:
https://www.quora.com/What-is-an-intuitive-explanation-for-PCA
http://www.quora.com/How-to-explain-PCA-in-laymans-terms
Again, I'm no expert and welcome others to correct/educate both rwvaldivia and me.

The concept of PCA is closely related to linear algebra, which is the domain of mathematics to which matrices belong. A common way to view a matrix is as a set of vectors, a MxN matrix are just M vectors in an N-dimensional space.
Now a general concept in linear algebra is that the choice of basis vectors is pretty arbitrary. If you choose another basis, you convert your matrix by multiplying it with the old basis expressed in the new basis (an NxN dimensional matrix itself).
PCA is a method to find a basis which isn't arbitrary, but specific to your matrix. In particular, it orders base vectors by the amount in which they're present in your set of vectors. If all your vectors point roughly in the same direction, that direction will be the first basis vector. If they're all roughly in the same plane, the major basis vectors for that plane will be your first two vectors. But remember: you'll generally a full MxN basis (unless your matrix is degenerate); it's up to you to decide how many of the Components are Principal.
Now here's the real question: what exactly is "the matrix of the image"? You generally can't treat a 1024 x 768 image as a set of 1024 vectors in 768-dimensional space. Sure, you can perform the PCA operation, and you will get a 1024x768 result matrix, but what does that even mean? They're the basis vectors of your input matrix, but that output doesn't have an image meaning exactly because your input is not a set of vectors.

Partition an n-dimensional "square" space into cubes

right now I am stuck solving the following "semi"-mathematical Problem.
I would like to partition an n-dimensinal restricted space (a hypercube to be precise)
D={(x_1, ...,x_n), x_i \in IR and -limits<=x_i<=limits \forall i<=n} Into smaller cubes.
Meaning I would like to specify n,limits,m where m would be the number of partitions per side of the cube - 2*limits/m would be the length of the small cubes and I would get m^n such cubes.
Now I would like to return a vector of vectors containing some distinct coordinates of these small cubes. (or perhaps one could represent the cubes as objects which are characterized by a vector pointing to the "left" outer corner ? )
Basically I have no idea whether something like that is even doable using C++. Implementing this for fixed n does not pose a problem. But I would like to enable the user to have free choice of the dimension.
Background: Something like that would be priceless in optimization. Where one would partition the space into smaller ones and use e.g. a genetic algorithms on each of the subspaces and later compare the results. Thus huge initial Populations could be avoided and the search results drastically improved.
Also I am just curious whether sth. like that is doable :)
My Suggestion: Use B+ Trees ?

Let m be the number of partitions per dimension, i.e. per edge, of the hypercube D.
Then there are m^n different subspaces S of D, like you say. Let the subspaces S be uniquely represented by integer coordinates S=[y_1,y_2,...,y_n] where the y_i are integers in the range 1, ..., m. In Cartesian coordinates, then, S=(x_1,x_2,...,x_n) where Delta*(y_i-1)-limits <= x_i < Delta*y_i-limits, and Delta=2*limits/m.
The "left outer corner" or origin of S you were looking for is just the point corresponding to the smallest x_i, i.e. the point (Delta*(y_1-1)-limits, ..., Delta*(y_n-1)-limits). Instead of representing the different S by this point, it makes a lot more sense (and will be faster in a computer) to represent them using the integer coordinates above.

Fitting data to a 3rd degree polynomial

I'm currently writing a C++ program where I have vectors of independent and dependent data that I would like to fit to a cubic function. However, I'm having trouble generating a polynomial that can fit my data.
Part of the problem is that I can't use various numerical packages, such as GSL (long story); it's possible that it might even be overkill for my case. I don't need a very generalized solution for least squares fitting. I specifically want to fit my data to a cubic function. I do have access to Sony's vector library, which supports 4x4 matrices and can calculate their inverses, among other things.
While prototyping this in Scilab, I used a function like:
function p = polyfit(x, y, n)
m = length(x);
aa = zeros(m, n+1)
aa(:,1) = ones(m,1)
for k = 2:n+1
aa(:,k) = x.^(k-1)
end
p = aa\y
endfunction
Unfortunately, this doesn't map well to my current environment. The above example needs to support a matrix of M x N+1 dimensions. In my case, that's M x 4, where M depends on how much sample data that I have. There's also the problem of left division. I would need a matrix library that supported the inverse of matrices of arbitrary dimensions.
Is there an algorithm for least squares where I can avoid having to calculate aa\y, or at least limit it to a 4x4 matrix? I suppose that I'm trying to rewrite the above algorithm into a simpler case that works for fitting to a cubic polynomial. I'm not looking for a code solution, but if someone can point me in the right direction, I'd appreciate it.

Here is the page I am working from, although that page itself doesn't address your question directly. The summary of my answer would be:
If you can't work with Nx4 matrices directly, then do those matrix
computations "manually" until you have the problem down to something that has only 4x4 or smaller matrices. In this answer I'll outline how to do the specific matrix computations you need "manually."
--
Let's suppose you have a bunch of data points (x1,y1)...(xn,yn) and you are looking for the cubic equation y = ax^3 + bx^2 + cx + d that fits those points best.
Then following the link above, you'd write this equation:
I'll write A, x, and B for those matrices. Then following my link above, you'd like to multiply by the transpose of A, which will give you the 4x4 matrix AT*A that you can invert. In equations, the following is the plan:
A * x = B .................... [what we started with]
(AT * A) * x = AT * B ..... [multiply by AT]
x = (AT * A)-1 * AT * B ... [multiply by the inverse of AT * A]
You said you are happy with inverting 4x4 matrices, so if we can code a way to get at these matrices without actually using matrix objects, we should be okay.
So, here is a method, although it might be a little bit too much like making your own matrix library for your taste. :)
Write an explicit equation for each of the 16 entries of the 4x4 matrix. The (i,j)th entry (I'm starting with (0,0)) is given by
x1i * x1j + x2i * x2j + ... + xNi * xNj.
Invert that 4x4 matrix using your matrix library. That is (AT * A)-1.
Now all we need is AT * B, which is a 4x1 matrix. The ith entry of it is given by x1i * y1 + x2i * y2 + ... + xNi * yN.
Multiply our hand-created 4x4 matrix (AT * A)-1 by our hand-created 4x1 matrix AT * B to get the 4x1 matrix of least-squares coefficients for your cubic.
Good luck!

Yes, we can limit the problem to computing with "a 4x4 matrix". The least squares fit of a cubic, even for M data points, only requires the solution of four linear equations in four unknowns. Assuming all the x-coordinates are distinct the coefficient matrix is invertible, so in principle the system can be solved by inverting the coefficient matrix. We assume that M is more than 4, as would typically be the case for least squares fits.
Here's a write-up for Maple, Fitting a cubic to data, that hides almost completely the details of what is being solved. The first-order minimum criteria (first derivatives with respect to coefficients as parameters of sum of squares error) gets us the four linear equations, often called the normal equations.
You can "assemble" these four equations in code, then apply your matrix inverse or a more sophisticated solution strategy. Obviously you need to have the data points stored in some form. One possibility is two linear arrays, one for the x-coordinates and one for the y-coordinates, both of length M the number of data points.
NB: I'm going to discuss this matrix assembly in terms of 1-based array subscripts. The polynomial coefficients are actually one application where 0-based array subscripts make things cleaner and simpler, but rewriting it in C or any other language that favors 0-based subscripts is left as an exercise for the reader.
The linear system of normal equations is most easily expressed in matrix form by referring to an Mx4 array A whose entries are powers of x-coordinate data:
A(i,j) = x-coordinate of ith data point raised to power j-1
Let A' denote the transpose of A, so that A'A is a 4x4 matrix.
If we let d be a column of length M containing the y-coordinates of data points (in the given order), then the system of normal equations is just this:
A'A u = A' d
where u = [p0,p1,p2,p3]' is the column of coefficients for the cubic polynomial with least squares fit:
P(x) = p0 + p1*x + p2*x^2 + p3*x^3
Your objections seem to center on a difficulty in storing and/or manipulating the Mx4 array A or its transpose. Therefore my answer will focus on how to assemble matrix A'A and column A'd without explicitly storing all of A at one time. In other words we will be doing the indicated matrix-matrix and matrix-vector multiplications implicitly to get a 4x4 system that you can solve:
C u = f
If you think about the entry C(i,j) being the product of the ith row of A' with the jth column of A, plus the fact that the ith row of A' is really just the transpose of the ith column of A, it should be clear that:
C(i,j) = SUM x^(i+j-2) over all data points
This is certainly one place where the exposition would be simplified by using 0-based subscripts!
It might make sense to accumulate the entries for matrix C, which depend only on the value of i+j, i.e. a so-called Hankel matrix, in a linear array of length 7 such that:
W(k) = SUM x^k over all data points
where k = 0,..,6. The 4x4 matrix C has a "striped" structure that means only these seven values appear. Looping over the list of x-coordinates of data points, you can accumulate the appropriate contributions of each power of each data point in the appropriate entry of W.
A similar strategy can be used to assemble the column f = A' d, namely to loop over the data points and accumulate the following four summations:
f(k) = SUM (x^k)*y over all data points
where k = 0,1,2,3. [Of course in the above sums the values x,y are the coordinates for a common data point.]
Caveats: This satisfies the goal of working only with a 4x4 matrix. However one typically tries to avoid the explicit formation of the matrix of coefficients for the normal equations because these matrices are often what in numerical analysis is called ill-conditioned. In particular the cases where x-coordinates are closely spaced can cause difficulty when one tries to solve the system by inverting the matrix of coefficients.
A more sophisticated approach to solving these normal equations would be the conjugate gradient method on the normal equations, which can be done with code that computes the matrix-vector products A u and A' v one entry at a time (using what we say above about entries of A).
The accuracy of the conjugate gradient method is often satisfactory because of its natural iterative approach, esp. when one can compute the required dot-products with a little extra precision.

You should never do full matrix inversion for stability reasons. Do LU decomposition and forward-back substitution. The other solutions are spot on otherwise.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js