Clarification re Principle Component Analysis - pca

I do understand the principle component analysis. I know how to do it and what it actually does. I have applied PCA and my best result has shown to be two components. I do understand that each of my inputs are now contributing partially in each component. What I do not understand is how to feed the result of PCA (in my case 2 components ) to a machine learning model?
How do we input them?
For example when I want to run a NN on my features, I just can navigate to where they are stored and import them, but my PCA analysis has been run in SPSS and all it shows me is the contribution of my features on each component.
What should I import to my NN model?

PCA is a method of feature extraction, which is used to avoid the problem of co-linearity. For example, if several variables are highly correlated because "they measure the same thing", then PCA can extract a measure of "that thing" (technically: a component), which is called a score. Your data set of, say, 100 measured variables may reduce to, say, 10 significant components. Then you can use the scores your test persons have achieved in those 10 components to do for example a multi-dimensional regression, a cluster analysis or a discriminance analysis. This will result in more valid results than performing the analysis directly on the 100 variables.
So the procedure is to sort the eigenvalues (and -vectors) by size, identify the number of significant components p (e.g., by scree-plot), set up the projection matrix F (eigenvectors corresponding to the largest q eigenvalues in columns) and multiply it with the data matrix D. This will give you the score matrix C (dimension n times q, with n the number of test persons), which you can use as input for whatever method you want to use next.

Related

Transition between PCA and factor analysis Eviews

I have 8 financial variables that vary across sectors (see attached file). In total I have 30 observations.
I want to run a PCA analysis in order two scores (one score for profitability and the other for indebtness). In my financial variables, I have Income, Capex, Ebitda.... that represent the profitability of the industry and Gearing and CFO coeff represent the indebtness of the industry.
I first run the correlation matrix. Then, view==> principal components. I want, now, to keep the first three principal components. So, if i understand well, in order to run the factor specification, I have to save the components.
So, I go to proc->make principal components. However, from that moment on, I am lost. How can eviews knows that I want to keep just three principal components. Is it under the "scores series"?

Train program to understand high and low value in machine learning

I am generating alerts by reading dataset for KPI (key performance indicator) . My algorithm is looking into historical data and based on that I am able to capture if there's sudden spike in data. But I am generating false alarms . For example KPI1 is historically at .5 but reaches value 12, which is kind of spike .
Same way KPI2 also reaches from .5 to 12. But I know that KPI reaching from .5 to 12 is not a big deal and I need not to capture that . same way KPI2 reaching from .5 to 12 is big deal and I need to capture that.
I want to train my program to understand what is high value , low value or normal value for each KPI.
Could you experts tell me which is best ML algorithm is for this and any package in python I need to explore?
This is the classification problem. You can use classic logistic regression algorithm to classify any given sample into either high value, low value or normal value.
Quoting from the Wikipedia,
In statistics, multinomial logistic regression is a classification
method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is
a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable,
given a set of independent variables (which may be real-valued,
binary-valued, categorical-valued, etc.)
To perform multi-class classification in python, sklearn library can be useful.
http://scikit-learn.org/stable/modules/multiclass.html

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).

Individual score contributions in ML estimation

I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.