How to plot PCA with paired data? - pca

I am currently working with genetic data from different patients. To date I have always worked with PCAs by comparing independent groups. Example: (Sick Vs Control, Treatment Vs Control etc.)
But now I have paired data. I mean that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
I did this PCA with Thermofisher program, but I would like to do in R. This is the output of the ThermoFisher program. B (Before treatment) P (Post-treatment)
I tried to looking for any example in Google, but I didn't found it.
An example would be something like this:
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("P_BT", 1:5, sep=""),
paste("P_AT", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
head(data.matrix)

Related

Unpredictable figure size in RMD chunk in code loop

I have an rmd chunk of r code with a loop. The structure of the code is like this:
```{r echo=FALSE, results="asis", out.width="100%"}
## out.width="100%"
## fig.width=12
## fig.height=(6+2*ceiling(6/4))
section_number <- 3
i = 1 ## for testing
while (i <= length(target_var_list)) {
target_var <- target_var_list[i]
data_segments <- data_segments(wrangled_devices, target_var)
# Code
exposure_chart_data <- monkeyr::get_exposure_chart_data(wrangled_obs, wrangled_devices, target_var)
exposure_plot <- monkeyr::get_exposure_plot(exposure_chart_data, target_var)
# knitr::opts_chunk$set(fig.height=(6+2*ceiling(data_segments/4)))
print(exposure_plot)
# print(exposure_plot, fig.height=(12+2*ceiling(data_segments/4)))
section_number <- section_number + 1
cat("\n\n\n")
i <- i + 1
}
```
I have commented out a few attempts I made to control the width and height of the plot. And I have commented out 2 attempts I made to control the knitr behaviour on a per plot basis.
The problem is that I can't find a reliable way to control the plot size that accommodates different lengths of the target_var_length.
It is possible to control the height at chunk level, but that is then fixed, and won't respond to each element in the loop. Here are some viz. What I would like is for the actual bar to be the same size in every case. So the case with 3 values would be 75% as wide as the 4. And the case with 7 would look be 2 rows, so twice the height of the 4. Do you see what I mean...
After quite a few hours of messing around with different approaches, here are some insights and an answer.
knitr::opts_chunk$set
I expected this to take effect on execution and change the chunk options for whatever elements follow. To change the plot height based on the number of rows / column in a facetted plot, I tried this:
knitr::opts_chunk$set(fig.height=(6+2*ceiling(data_segments/4)))
However it has no effect. The documentation bears this out. This actually sets the default chunk settings for subsequent chunks, and has no effect whatsoever on the current chunk. I encountered another function:
knitr::opts_current$set(fig.height=(6+2*ceiling(data_segments/4)))
The documentation as much as warns you off using this. And I found that it didn't achieve the expected results either in any case.
Blind Hope
I considered the possibility that I was overthinking this and left it up to blind hope by removing all efforts to control the height. Sometimes things just work out you know! ... They didn't.
Using an rmd child chunk
This is the approach that I finally got to work. It's a slightly horrible hack. My first effort was to create a separate rmd file for each plot:
```{r echo=FALSE, results="asis", out.width="100%", fig.height=(6+2*ceiling(data_segments/4))}
print(myPlot)
```
But that meant creating lots of new rmd plots. I have a major problem with how messy that would get. So I cleaned it up by using a single rmd file for any plot and lumped the code to call it into a fucntion.
resize_plot <- function(resizePlot, resizeHeight) {
resizePlot <- resizePlot
resizeHeight <- resizeHeight
res <- knitr::knit_child('resizePlot.rmd', quiet = TRUE)
cat(res, sep = '\n')
}
Now to insert a custom height plot I just call my new function:
resize_plot(exposure_plot, 3.25*ceiling(data_segments/4))
And the single rmd file just looks like this:
```{r echo=FALSE, results="asis", out.width="100%", fig.height=resizeHeight}
print(resizePlot)
```
And bingo - it looks perfect!

how to extract fitted values after multiple imputation

busan<-subset(influ_busan, select = c(CNT,temp_min,temp_diff,humid_mean,hpa_mean,rad_mean,wind_mean,o3))
new_busan<-mice(busan, seed=12345, n=5)
lm_busan <- with(new_busan,lm(CNT~temp_min+temp_diff+humid_mean+hpa_mean+rad_mean+wind_mean+o3))
summary(lm_busan)
busan_predict<-data.frame(fitted.values(lm_busan))
This is my simply version syntax. I use multiple imputation for NA and After multiple imputation, I want to extract fitted values. However I can't extract fitted values, how can I extract fitted values?
You can do this via extract_imputations function from my version of mice; hopefully will be incorporated into the main mice version shortly:
see: https://github.com/stefvanbuuren/mice/pull/51
devtools::install_github("alexwhitworth/mice")
library(mice)
new_busan <- mice(busan, seed= 12345, m=2)
busan_predict <- extract_imputations(busan, new_busan$imp, j= 1)
busan_predict <- extract_imputations(busan, new_busan$imp, j= 2)
Edit Apparently, I didn't read the mice documentation thoroughly enough. This functionality already existed in mice -- mice::complete

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.
You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris
You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

Rpart - accuracy of bigrams

Good evening, everyone!
I am facing a problem in R. I have a dataset containing Amazon reviews of the Playstation 4 and I would like to create a prediction model with the help of rpart and also would like to have the accuracy of this model.
The reviews have been successfully loaded to R, a corpus has been created and some preprocessing tasks have been applied:
library(RWeka)
library(tm)
library(rpart)
corpus <- Corpus(VectorSource(tr.review.ps4$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
The bigrams and a term document matrix are created with the following code:
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
txtTdmBi <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer, bounds = list(global=c(10, Inf))))
Then sparse-terms are deleted and a matrix is created:
dtm <- removeSparseTerms(txtTdmBi, 0.999)
dtmsparse <- as.data.frame(as.matrix(txtTdmBi))
The original dataset consists of 7561 objects. Therefore a training and test set is created as follows:
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Then the training is done. $overall refers to the star rating from one to five.
train$overall <- tr.review.ps4[1:6500,]$overall
When using unigrams the prediction model is created as follows:
model <- rpart(overall ~., data = train, method= 'class')
However, this is not working in my case because - I guess - the connection to the original review dataset has to be established. But how? I don't have an idea.
When I am entering this code I get following error-output:
Error in terms.formula(formula, data = data) :
Can anyone help me? Thanks a lot.
Best regards
Paul
today I still was searching for a solution of my problem. Luckily I found the mistake.
The errore message occured because the TermDocumentMatrix was in the wrong postion.
I had to transpose the matrix with the following code:
txtTdmBi.t=t(txtTdmBi)
Finally it worked.
Best regards
Paul

How do you combine multiple boxplots from a List of data-frames?

This is a repost from the Statistics portion of the Stack Exchange. I had asked the question there, I was advised to ask this question here. So here it is.
I have a list of data-frames. Each data-frame has a similar structure. There is only one column in each data-frame that is numeric. Because of my data-requirements it is essential that each data-frame has different lengths. I want to create a boxplot of the numerical values, categorized over the attributes in another column. But the boxplot should include information from all the data-frames.
I hope it is a clear question. I will post sample data soon.
Sam,
I'm assuming this is a follow up to this question? Maybe your sample data will illustrate the nuances of your needs better (the "categorized over attributes in another column" part), but the same melting approach should work here.
library(ggplot2)
library(reshape2)
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(1000))
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
#Separate boxplots for each data.frame
qplot(factor(variable), value, data = df, geom = "boxplot")
#All values plotted together as one boxplot
qplot(factor(1), value, data = df, geom = "boxplot")
a<-data.frame(c(1,2),c("x","y"))
b<-data.frame(c(3,4,5),c("a","b","c"))
boxplot(c(a[1],b[1]))
With the "1"'s i select the column i want out of the data-frame.
A data-frames can not have different column-lengths (has to have same number of rows for each column), but you can tell boxplot to plot multiple datasets in parallel.
Using the melt() function and base R boxplot:
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(100) + 5)
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
# plot using base R boxplot function
boxplot(value ~ variable, data = df)