Using Principal Components Analysis (PCA) on binary data - pca

I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.
Thank you.

The principal components of 0/1 data can fall off slowly or rapidly,
and the PCs of continuous data too —
it depends on the data. Can you describe your data ?
The following picture is intended to compare the PCs of continuous image data
vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.
Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U VT, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long.
The top row is c U1 V, the second row is c U2 V ...
all the rows are proportional to V.
Similarly the leftmost column is c U V1 ...
all the columns are proportional to U.
But if all rows are similar (proportional to each other),
they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1T + c2 U2 V2T + ...,
we can get nearer to A: the smaller the higher ci, the faster..
(Of course, all 500 terms recreate A exactly, to within roundoff error.)
The top row is "lena", a well-known 512 x 512 matrix,
with 1-term and 10-term SVD approximations.
The bottom row is lena discretized to 0/1, again with 1 term and 10 terms.
I thought that the 0/1 lena would be much worse -- comments, anyone ?
(U VT is also written U ⊗ V, called a "dyad" or "outer product".)
(The wikipedia articles
Singular value decomposition
and Low-rank approximation
are a bit math-heavy.
An AMS column by
David Austin,
We Recommend a Singular Value Decomposition
gives some intuition on SVD / PCA -- highly recommended.)

Related

PCA doesn't reduce the dimensionality of my data

I would like to apply PCA on heatmaps of 18 dimensions.
dim(heatmaps)=(224,224,18)
Since PCA takes only data of dim <= 2. I reshape my heatmaps as follow :
heatmaps=heatmaps.reshape(-1,18)
heatmaps.shape
(50176, 18)
Now, l would to apply PCA and take the first components that preserve 95% of variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=18)
reduced_heatmaps=pca.transform(heatmaps)
However the dimension of reduced_heatmaps remains the same as the original heatmaps (50176, 18).
My question is as follow :
How to reduce the dimensionality of my heatmaps while preserving 95% of variance ?
Strange thing
pca.explained_variance_ratio_.cumsum()
array([ 0.05744624, 0.11482341, 0.17167621, 0.22837643, 0.284996 ,
0.34127299, 0.39716828, 0.45296374, 0.50849681, 0.56382308,
0.61910508, 0.67425335, 0.72897448, 0.78361028, 0.83813329,
0.89247688, 0.94636864, 1. ])
It means, I need to keep 17 components to reduce the dimensionality of my data such that l have 18 dimensions.
What is wrong ?
EDIT : following the suggestions of Eric Yang
heatmaps=heatmaps.reshape(18,-1)
heatmaps.shape
(18,50176)
Then applying PCA as follow :
pca = PCA(n_components=11)
reduced_heatmaps=pca.fit_transform(heatmaps)
pca.explained_variance_ratio_.cumsum()
results the following :
array([ 0.21121199, 0.33070526, 0.44827572, 0.55748779, 0.64454442,
0.72588593, 0.7933346 , 0.85083687, 0.89990991, 0.9306283 ,
0.9596194 ], dtype=float32)
11 components is needed to explain 95% variance of my data.
reduced_heatmaps.shape
(18, 11)
Hence we go from (18,50176) to (18, 11)
Thank you for your help
The ability to reduce your variance is a function of your data. If you have an N dimensional gaussian with each dimension N(0,1), each dimension will explain 1/N of your variance, and so your ability to reduce dimensions via PCA would be minimal. So the results of PCA does not seem to be incorrect.
Now based on a superficial understanding of your problem, you have 18 images that are 224x224 correct? If that is correct, then your dimensionality is 224x224 not 18. So you'd want to ask what is the minimum number of pixels in my image that explain the difference between my 18 images. (However, I could be wrong if that is not the assumption, and what you have is 18 channels for 1 image)
There is one other possibility in which you have a series of similar images (and so your dimensionality is going to be 18), and you're looking for the Eigen image. If the images are too different, you will have a minimal reduction in the dimensionality.

Precomputed distances for spectral clustering with scikit-learn

I'm struggling to make sense of the spectral clustering documentation here.
Specifically.
If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:
np.exp(- X ** 2 / (2. * delta ** 2))
For my data, I have a complete distance matrix of size (n_samples, n_samples) where large entries represent dissimilar pairs, small values represent similar pairs and zero represents identical entries. (I.e. the only zeros are along the diagonal).
So all I need to do is build the SpectralClustering object with affinity = "precomputed" and then pass the transformed distance matrix to fit_predict.
I'm stuck on the suggested transformation equation. np.exp(- X ** 2 / (2. * delta ** 2)).
What is X here? The (n_samples, n_samples) distance matrix?
If so, what is delta. Is it just X.max()-X.min()?
Calling np.exp(- X ** 2 / (2. * (X.max()-X.min()) ** 2)) seems to do the right thing. I.e. big entries become relatively small, and small entries relatively big, with all the entries between 0 and 1. The diagonal is all 1's, which makes sense, since each point is most affine with itself.
But I'm worried. I think if the author had wanted me to use np.exp(- X ** 2 / (2. * (X.max()-X.min()) ** 2)) he would have told me to use just that, instead of throwing delta in there.
So I guess my question is just this. What's delta?
Yes, X in this case is the matrix of distances. delta is a scale parameter that you can tune as you wish. It controls the "tightness", so to speak, of the distance/similarity relation, in the sense that a small delta increases the relative dissimmilarity of faraway points.
Notice that delta is proportional to the inverse of the gamma parameter of the RBF kernel, mentioned earlier in the doc link you give: both are free parameters which can be used to tune the clustering results.

Eigen sum of matrices resulting in NaN and -inf values

I am having a strange issue with using Eigen (Tuxfamily) in my software (in c++).
I am analyzing a 3D volume image by calculating for each pixel an Hessian matrix.
The volume (approx 800x800x600) is divided in subvolumes and for each subvolume i sum up all the obtained matrices and then divide them by the amount to obtain the average (and then i do the same summing up all the averages and dividing by the number of subvolumes to obtain the average for the full volume).
The matrices are of type Matrix3d.
The problem is, that for most of the sums (and obviously for the averages as well) i obtain something like :
Elements analyzed : 28215
Elements summed : 28215
Subvolume sum :
5143.76 | nan | -2778.05
5402.07 | 16011.9 | -inf
-2778.05 | -8716.86 | 7059.32
I sum them this way :
for(int i = 0;i<(int)OuterVector.size();i++){
AverageProduct+=OuterVector[i];
}
Due to the nature of the matrices i know that they should be symmetrical on the diagonal, so the correct value is calculated for some of them. Any idea on why the others might be failing? (and consider that it's always the same two position of the matrix giving me nan and -inf)
Ok, using a mix of the suggestions you guys gave me in the comments, i tried a couple of random fixes and i solved the problem.
When i was creating the Eigen::Matrix3d object, i was not initializing the values, so somehow as soon as i was adding the first OuterVector[i] those two values were going wild (the (0,1) was going to nan and the (1,2) was going to inf). Strange that it was only happening only for those two specific values and in the same identical way every time.
So doing (at initialization time)
Matrix3d AverageProduct << 0,0,0,0,0,0,0,0,0;
was enough to fix it.

PCA in Matlab - Are the Principal Compoents re-arranged?

I am trying to do a PCA on some volatility data, and let's just say I can propose a model as the following:
volatility = bata0 + beta1*x + beta2* x^2
where x are some observations, say for example, moneyness and so on.
So in Matlab, what I did was to say Y=[ones x x^2] and then do pca(Y)
and for some reason, my first row in my coefficient matrix is always something like 0 0 1, i.e., 0 everywhere else except the last column, and output of atent always shows the highest value in the first row as well, no matter how I change the model.
Obviously, this can't be the case where the last term in every single model is explained well by the last term in the equation. And if I remove the constant term in Y (i.e., Y= [x x^2] then the first row of coefficient matrix becomes something more normal (i.e., non-zero value everywhere).
So my questions are:
is my way of doing PCA right?
Does PCA automatically rearrange the principal component and hence the first row in the coefficient matrix with all zeros except 1 at the last column may not necessarily represent the last term in the equation and
if it is wrong, what is the correct way of doing it?
From Matlab's documentation for princomp:
COEFF = princomp(X) performs principal components analysis (PCA) on
the n-by-p data matrix X, and returns the principal component
coefficients, also known as loadings. Rows of X correspond to
observations, columns to variables. COEFF is a p-by-p matrix, each
column containing coefficients for one principal component. The
columns are in order of decreasing component variance.

How to get ALL data from 2D Real to Complex FFT in Cuda

I am trying to do a 2D Real To Complex FFT using CUFFT.
I realize that I will do this and get W/2+1 complex values back (W being the "width" of my H*W matrix).
The question is - what if I want to build out a full H*W version of this matrix after the transform - how do I go about copying some values from the H*(w/2+1) result matrix back to a full size matrix to get both parts and the DC value in the right place
Thanks
I'm not familiar with CUDA, so take that into consideration when reading my response. I am familiar with FFTs and signal processing in general, though.
It sounds like you start out with an H (rows) x W (cols) matrix, and that you are doing a 2D FFT that essentially does an FFT on each row, and you end up with an H x W/2+1 matrix. A W-wide FFT returns W values, but the CUDA function only returns W/2+1 because real data is even in the frequency domain, so the negative frequency data is redundant.
So, if you want to reproduce the missing W/2-1 points, simply mirror the positive frequency. For instance, if one of the rows is as follows:
Index Data
0 12 + i
1 5 + 2i
2 6
3 2 - 3i
...
The 0 index is your DC power, the 1 index is the lowest positive frequency bin, and so forth. You would thus make your closest-to-DC negative frequency bin 5+2i, the next closest 6, and so on. Where you put those values in the array is up to you. I would do it the way Matlab does it, with the negative frequency data after the positive frequency data.
I hope that makes sense.
There are two ways this can be acheived. You will have to write your own kernel to acheive either of this.
1) You will need to perform conjugate on the (half) data you get to find the other half.
2) Since you want full results anyway, it would be best if you convert the input data from real to complex (by padding with 0 imaginary) and performing the complex to complex transform.
From practice I have noticed that there is not much of a difference in speed either way.
I actually searched the nVidia forums and found a kernel that someone had written that did just what I was asking. That is what I used. if you search the cuda forum for "redundant results fft" or similar you will find it.