Retrieving PCA variables coefficients from PCA components - pca

I am trying to explain a final score in the final assessment (predicted variable) via the scores of continuous assessment in seven subjects (predictive variables), of course I proceed with an ACP.
I get two principal components, the first one is explained largely with 3 subjects and the second is explained with two other subjects
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 6.30728154 90.1040220 90.10402
## Dim.2 0.25481341 3.6401916 93.74421
## Dim.3 0.14504576 2.0720823 95.81630
## Dim.4 0.10560431 1.5086330 97.32493
## Dim.5 0.08744117 1.2491595 98.57409
## Dim.6 0.05931222 0.8473174 99.42141
## Dim.7 0.04050159 0.5785942 100.00000
The first two components explain 93% of the variance, but I want to know the contribution of each subject to each component, so I make this chart:
So as you can see, the ACP is not reducing the dimension as much as I expected, because we can see the only two subjects could be eliminated, from 7 variables to 5 variables isn't good enough, is it because of a problem with the data, its containing only 95 observations ??
Please help out, thanks

Related

using a list of regressors and storing the values of betas

I have a list of circumstances and effects:
I want to generate a matrix with betas containing the values of betas. I am going to run the loop 10 times, because i am in fact going to bootstrap my observations.
So far I have tried:
local circumstances height weight
local effort training diet
foreach i in 1 10 {
reg outcome `circumstances' `effects'
* store in column i the values of betas of circumstances
* store in column i the values of betas of effort
}
Does anyone know what should the code look like in order to store those values?
Thank you
The pseudocode would first store in "column 1" the first lot of betas and then overwrite them (column 1) with the second lot of betas. Then it would do the same again for column 10 with the first lot of betas and the second lot of betas. That is a long way from anything that makes sense. Nothing in your pseudocode takes bootstrap samples from the dataset, although perhaps you are intending to add code for that later.
Stata doesn't really work with any idea of column numbers, although the idea makes sense to Mata.
Unless there are very specific reasons -- which you would need to spell out -- there is no need to write your own code ab initio for bootstrapping, as the whole point of bootstrap is to do that for you.
Here is complete code for a reproducible example of bootstrapping a silly regression:
sysuse auto, clear
bootstrap b_weight=_b[weight] b_price=_b[price] , reps(1000) seed(2803) : regress mpg weight price
See also the help for bootstrap to learn about its other options, including saving().
10 repetitions would be regarded as absurdly small for the number of bootstrap samples.

How to access by() output?

I have a large data.frame containing different forest sites, tree species and their dimensions. For some trees I have height and dbh data, for some I only have dbh. I need to calculate missing heights for additional evaluation. Height is site and species specific which is why I used the by() function on a with_height subset:
tmp <- with(with_height,
by(with_height, with_height[,1:2], #with_height[,1:2] are site and species
function(x) lm(height~log(dbh), data = x)))
This works out and creates a large list (1144 unnamed elements, 9.8Mb).
How do I access this list? I need either the lm() or the coefficients for each real combination of site and species (without NULL/ZERO responses if a species did not occur).
I found that
tmp[[1]]$coefficients
returns
tmp[[1]]$coefficients
(Intercept) log(dbh)
-16.36298 11.18222
But how do I know to which site-species combination this is related to? And is there a way to do this for all real site-species combinations simultanously?
I already spent hours on that question and would be very thankfull for any help and advices!

Transition between PCA and factor analysis Eviews

I have 8 financial variables that vary across sectors (see attached file). In total I have 30 observations.
I want to run a PCA analysis in order two scores (one score for profitability and the other for indebtness). In my financial variables, I have Income, Capex, Ebitda.... that represent the profitability of the industry and Gearing and CFO coeff represent the indebtness of the industry.
I first run the correlation matrix. Then, view==> principal components. I want, now, to keep the first three principal components. So, if i understand well, in order to run the factor specification, I have to save the components.
So, I go to proc->make principal components. However, from that moment on, I am lost. How can eviews knows that I want to keep just three principal components. Is it under the "scores series"?

Clarification re Principle Component Analysis

I do understand the principle component analysis. I know how to do it and what it actually does. I have applied PCA and my best result has shown to be two components. I do understand that each of my inputs are now contributing partially in each component. What I do not understand is how to feed the result of PCA (in my case 2 components ) to a machine learning model?
How do we input them?
For example when I want to run a NN on my features, I just can navigate to where they are stored and import them, but my PCA analysis has been run in SPSS and all it shows me is the contribution of my features on each component.
What should I import to my NN model?
PCA is a method of feature extraction, which is used to avoid the problem of co-linearity. For example, if several variables are highly correlated because "they measure the same thing", then PCA can extract a measure of "that thing" (technically: a component), which is called a score. Your data set of, say, 100 measured variables may reduce to, say, 10 significant components. Then you can use the scores your test persons have achieved in those 10 components to do for example a multi-dimensional regression, a cluster analysis or a discriminance analysis. This will result in more valid results than performing the analysis directly on the 100 variables.
So the procedure is to sort the eigenvalues (and -vectors) by size, identify the number of significant components p (e.g., by scree-plot), set up the projection matrix F (eigenvectors corresponding to the largest q eigenvalues in columns) and multiply it with the data matrix D. This will give you the score matrix C (dimension n times q, with n the number of test persons), which you can use as input for whatever method you want to use next.

Block bootstrap with indicator variable for each block

I want to run block bootstrap, where the blocks are countries, and include country indicator variables. I thought the following would work.
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
But I get the following error.
. regress mvalue kstock i.country, vce(bootstrap, cluster(country))
(running regress on estimation sample)
Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxx 50
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
It seems that this should work. If the block bootstrap picks the same country for every block, then it seems it should just drop the intercept.
Is my error coding or conceptual? Here is some code using the grunfeld data.
webuse grunfeld, clear
xtset, clear
generate country = int((company - 1) / 2) + 1
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
The problem here is not with your coding, but is conceptual. The problem is that you cannot identify each coefficient in each regression in each bootstrap sample. Not all "countries" are included in the dataset for each bootstrap repetition. You can diagnose what is going on with the vce( , noisily) sub-option:
. regress mvalue kstock i.bscountry, vce(bootstrap, cluster(country) noisily)
Errors are generated because some coefficients are missing when the regression runs with particular bootstrap samples. In each regression you can see that some countries dummies are being omitted due to collinearity. This should be expected and makes a lot of sense -- the country dummies could =0 for all observations in the bootstrap sample if the country was not drawn!
If you are really trying to estimate the coefficients on the country dummies, you are going to have to find another approach than bootstrapping with K clusters if K is the number of countries. If you don't care about the coefficient dummies you could use another command that simply absorbs the fixed effects and only reports the coefficients on the other independent variables (e.g., areg or xtreg). One way think about what is going on is that it is analogous to this:
.bootstrap, cluster(country) idcluster(bscountry) noisily: regress mvalue kstock i.bscountry
With the idcluster() option, each country that is drawn in a bootstrap sample is given its own ID number. If a country is drawn twice then there are two dummies. (The coefficients for the two dummies naturally turn out to be identical or near-identical.) However, the coefficients in this output are are completely meaningless because bscountry "2" will be different countries in different bootstrap iterations. Since you would ignore any output on the dummies, you might as well use a model like areg or xtreg since they run more quickly.
Although there are many applications where bootstrapping with clusters would work fine, the problem here is the inclusion of cluster dummies in the regression. This all begs the question of whether this exercise makes any sense at all. If you are trying to estimate the coefficients for the country dummies, then certainly not. Otherwise, the solutions above might be OK, but it is hard to say without knowing your research question.