PC1 and PC2 values: original values - pca

I just ran PC analysis in r on the iris data set. This has been discussed several times in the past but I am little confused on the output.
I used prcomp and this is output for the loadings:
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
Here is the first 6 rows for the scores:
PC1 PC2 PC3 PC4
[1,] -2.257141 -0.4784238 0.12727962 0.024087508
[2,] -2.074013 0.6718827 0.23382552 0.102662845
[3,] -2.356335 0.3407664 -0.04405390 0.028282305
[4,] -2.291707 0.5953999 -0.09098530 -0.065735340
[5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
[6,] -2.068701 -1.4842053 -0.02687825 0.006586116
Here is the first 6 rows for the original values:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
Could someone explain how we get the PC1 score of -2.25 for row 1?
thanks.

As per the documentation (?prcomp), the PC scores are the data — centred and scaled, if requested — multiplied by the rotation matrix. So, let's do that calculation for row 1 and PC 1 to check. In this example, I use a PCA object imaginatively called pca.
First, we centre the first row of data, iris[1, 1:4], using pca$center and then scale using pca$scale. Finally, we multiply by the loadings for PC 1, pca$rotation[, 1], and sum the result.
# Perform PCA
pca <- prcomp(iris[, 1:4], center = TRUE, scale = TRUE)
# Calculate PC1 score for first row of 'iris'
sum(pca$rotation[,1] * (iris[1, 1:4] - pca$center) / pca$scale)
#> [1] -2.257141
Created on 2019-01-23 by the reprex package (v0.2.1.9000)
As expected, we get -2.257141.

Related

Define a new variable name based on a conditon within R dplyr universe (mutate, if, ifelse)

I want to add a new variable in an dplyr workflow and define the variable name based on a condition. There is a lot of discussion on conditional mutating with ifelse() out there on how to define values of a given variable, but not on how to conditionally define the name.
Something like:
Test <- 'A'
Test_results <- c(1.1, 33, 343, 2.22, 2.4)
##
iris<- iris%>%
dplyr::mutate(
ifelse(Test=='A',
Test_A=Test_results,
ifelse(Test=='B',
Test_B=Test_results,
no_Test='no_results')) )
Desired output (given that Test <- 'A') is:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test_A
1 5.1 3.5 1.4 0.2 setosa 1.1
2 4.9 3.0 1.4 0.2 setosa 33
3 4.7 3.2 1.3 0.2 setosa 343
4 4.6 3.1 1.5 0.2 setosa 2.22
5 5.0 3.6 1.4 0.2 setosa 2.4
...
If Test <- 'B' the result should be:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test_B
1 5.1 3.5 1.4 0.2 setosa 1.1
2 4.9 3.0 1.4 0.2 setosa 33
3 4.7 3.2 1.3 0.2 setosa 343
4 4.6 3.1 1.5 0.2 setosa 2.22
5 5.0 3.6 1.4 0.2 setosa 2.4
...
The variable "Test" is defined somewhere in the users cockpit and does effects on multiple nested scripts (so no hard coding pls).
dplyr::rename_at should work. Create a column with the tests and rename it with a function that depends on Test.
Test <- 'A'
Test_results <- c(1.1, 33, 343, 2.22, 2.4)
iris %>%
head(n = 5) %>%
mutate(Test_results = Test_results) %>%
rename_at('Test_results', \(x) case_when(Test %in% c('A', 'B') ~ paste0('Test_', Test),
TRUE ~ 'no_results'))

looping flextable from a list does not create tables in Word

I have series of tables I created from a list. I'm using flextable() in R-Markdown to output them to a Word document and I'd like to separate them into one table per page. The workaround from this earlier posted question doesn't fit my issue. Below is my reproducible code:
```{r echo = FALSE, message = FALSE, warning=FALSE}
library(tidyverse)
library(flextable)
my_list <- iris %>%
group_by(Species) %>%
split(f = as.factor(.$Species))
for (i in 1:length(my_list)) {
myft <- flextable(my_list[[i]]) %>%
set_caption(paste0("Table ", as.roman(i),".\n", names(my_list[i]), ".")) %>%
padding(padding = 1.5, part = "all")
print(myft)
cat("\n\n")
}
```
What I get in my output is this:
## a flextable object.
## col_keys: `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`
## header has 1 row(s)
## body has 50 row(s)
## original dataset sample:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
##
##
## a flextable object.
## col_keys: `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`
## header has 1 row(s)
## body has 50 row(s)
## original dataset sample:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 5.5 2.3 4.0 1.3 versicolor
## 5 6.5 2.8 4.6 1.5 versicolor
##
##
## a flextable object.
## col_keys: `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`
## header has 1 row(s)
## body has 50 row(s)
## original dataset sample:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.3 3.3 6.0 2.5 virginica
## 2 5.8 2.7 5.1 1.9 virginica
## 3 7.1 3.0 5.9 2.1 virginica
## 4 6.3 2.9 5.6 1.8 virginica
## 5 6.5 3.0 5.8 2.2 virginica
Thanks for the help!
Here's how I solved the above issue based on David Gohel's reply.
```{r results='asis'}
library(tidyverse)
library(flextable)
#Print annual yield and area per field
my_list <- iris %>%
group_by(Species) %>%
split(f = as.factor(.$Species))
for (i in 1:length(my_list)) {
myft <- flextable(my_list[[i]]) %>%
set_caption(paste0("Table ", as.roman(i),".\n", names(my_list[i]), ".")) %>%
padding(padding = 1.5, part = "all")
cat("\n\n\\pagebreak\n") #added this bit to break page per table
flextable_to_rmd(myft)
}
```

Boxplots lose "box" nature when plotting weighted data

I have the following data in Stata:
input drug halflife hl_weight
3 2.95 0.0066
2 6.00 0.0004
5 13.60 0.0006
1 2.82 0.0331
4 8.80 0.0001
4 1.24 0.0075
2 6.25 0.1123
4 17.20 0.0002
5 14.50 0.0020
4 5.50 0.0016
5 13.30 0.0003
4 8.26 0.0201
4 16.50 0.0103
4 11.40 0.0016
4 5.90 0.0005
4 3.99 0.0100
4 2.80 0.0073
4 3.00 0.0133
4 3.17 0.0061
4 4.95 0.1404
end
I am trying to create boxplots of drug halflives using the command below:
graph box halflife [aweight=hl_weight], over(drug)
When I include the weight option, some of the resulting box plots consist of multiple dots instead of the typical interquartile range and median:
Why does this happen and how can I fix it?
Obviously, this happens because of the weighting. The weights give more emphasis to values that are well outside the interquartile range.
I do not think there is anything to fix here. You could try to use the nooutsides option of the graph box command to hide the dots but i would not recommend it.

Stata- calculating conditional means, subtracting them, dividing the difference

My dataset is like this.....
Pizzas Hamburgers Type
10.7 5.6 1
9.6 6.7 2
13.4 4.1 3
7.2 3.7 4
Here is what I need to do (this is essentially calculating a Wald estimator in econometrics, if you are familiar, if not, no biggie)
I need to create new categories so that if the observation is type 1 then it is 'first' and if it is 2, 3, or 4, it is 'other'
calculate the averages of pizzas and hamburgers by first and other
subtract the means between first and other
divide the differences
There must be more structure than this to the problem; otherwise it's school arithmetic. This may get you started, but I think you need to show more substance about your data structure and larger goals. In a larger dataset, collapse may be a good idea, depending on what you want to do with the results.
clear
input Pizzas Hamburgers Type
10.7 5.6 1
9.6 6.7 2
13.4 4.1 3
7.2 3.7 4
end
gen First = Type == 1
egen MeanPizzas = mean(Pizzas), by(First)
egen MeanHamb = mean(Hamb), by(First)
sort First
gen DiffMeanPizzas = MeanPizzas[1] - MeanPizzas[_N]
gen DiffMeanHamb = MeanHamb[1] - MeanHamb[_N]
tabdisp First, c(Mean* Diff*)
--------------------------------------------------------------------------
First | MeanPizzas MeanHamb DiffMeanPizzas DiffMeanHamb
----------+---------------------------------------------------------------
0 | 10.06667 4.833333 -.6333332 -.7666669
1 | 10.7 5.6 -.6333332 -.7666669
--------------------------------------------------------------------------

Filter Pandas DataFrame by group with tag values

I want to filter a DataFrame by group, since the following nan after a, are supposed to be a (this is something like a tag), and nans followed by b, are also b.
I have a short example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'group1': ['a',nan,nan,nan,nan,'b',nan,nan,nan,nan],
'value1': [0.4,1.1,2,3,4,5,6,7,8,8.8],
'value2': [6.4, 6.9,7.1,8,9,10,11,12,13,14]
})
My desired output would be:
In [3]: df[df.group1 == 'a']
Out[3]:
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
I'll apreciate any hint!
You can use ffill to forward-fill the column:
>>> df[df['group1'].fillna(method='ffill') == 'a']
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
but, perhaps the better solution would be to forward-fill the column on the original data-frame:
>>> df['group1'].fillna(method='ffill', inplace=True)
>>> df[df['group1'] == 'a']