I'm new to PCA, I have a dataset of 64 features and I am trying to get the most important features using PCA. When running PCA that explains 90% of the variance in my dataset I am getting about 40 principal components, and my question is, how can I get the feature importance based on all these principal components ?in the pic1 shown the number of principal components that explains 90% of the variance
should I sum the values of all principal components for each feature and then sort it in descending order ?
just run a regression model and check values of the importance statistics associated with each feature. Check out this article for discussion of feature importance.
Related
I have conducted a PCA (in Matlab) on a set of thousands of points of spatial data. I have also calculated the variance explained across the full dataset by each principal component (i.e. PC or eigenvector) by dividing its eigenvalue by the sum of all eigenvalues. As an example, PC 15 accounts for 2% of the variance in the entire dataset; however, there is a subset of points in this dataset for which I suspect PC 15 accounts for a much higher % of their variance (e.g. 80%).
My question is this, is there a way to calculate the variance explained by a given PC from my existing analysis for only a subset of points (i.e. 1000 pts from the full dataset of 500k+). I know that I could run another PCA on just the subset, but for my purposes, I need to continue to use the PCs from my original analysis. Any idea for how to do this would be very helpful.
Thanks!
I've created a number of models using Google AutoML and I want to make sure I'm interpreting the output data correctly. This if for a linear regression model predicting website conversion rates on any given day.
First the model gives a model feature importance when the model has completed training. This seems to tell me which feature was most important in predicting the target value but not necessarily if it contributes most to larger changes in that value?
Secondly, we have a bunch of local feature weights which I think tell me the contribution each feature has made to prediction. So say feature weight of bounce rate has a weight of -0.002 we can say that the bounce rate for that row decreased the prediction by 0.002? Is there a correct way to aggregate that, is it just the range?
Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.
I am using t-SNE for exploratory data analysis. I am using this instead of PCA because PCA is linear and t-SNE is non-linear.
It's really straight-forward to know how many dimensions are required to capture the necessary variance with PCA.
How do I know how many dimensions are required for my data using t-SNE?
I have read a popular website of very useful information, but it doesn't discuss dimensionality.
https://distill.pub/2016/misread-tsne/
Currently I am following the caffe imagenet example but apply it on my own training data set. My dataset is about 2000 classes and about 10 ~ 50 images each class. Actually I was classifying vehicle images and the images were cropped to the front, so the images within each class have the same size, the same view angle(almost).
I've tried the imagenet schema but looks like it didn't work well and after about 3000 iterations the accuracy was down to 0. So I am wondering is there a practical guide on how to tune the schema?
You can delete the last layer in imagenet, add your own last layer with a different name(to fit the number of classes), specify it with a higher learning rate, and specify a lower overall learning rate. There does exist an official example here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
However, if the accuracy was 0 you should check the model parameters first, perhaps it's an overflow