How do you combine multiple boxplots from a List of data-frames? - list

This is a repost from the Statistics portion of the Stack Exchange. I had asked the question there, I was advised to ask this question here. So here it is.
I have a list of data-frames. Each data-frame has a similar structure. There is only one column in each data-frame that is numeric. Because of my data-requirements it is essential that each data-frame has different lengths. I want to create a boxplot of the numerical values, categorized over the attributes in another column. But the boxplot should include information from all the data-frames.
I hope it is a clear question. I will post sample data soon.

Sam,
I'm assuming this is a follow up to this question? Maybe your sample data will illustrate the nuances of your needs better (the "categorized over attributes in another column" part), but the same melting approach should work here.
library(ggplot2)
library(reshape2)
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(1000))
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
#Separate boxplots for each data.frame
qplot(factor(variable), value, data = df, geom = "boxplot")
#All values plotted together as one boxplot
qplot(factor(1), value, data = df, geom = "boxplot")

a<-data.frame(c(1,2),c("x","y"))
b<-data.frame(c(3,4,5),c("a","b","c"))
boxplot(c(a[1],b[1]))
With the "1"'s i select the column i want out of the data-frame.
A data-frames can not have different column-lengths (has to have same number of rows for each column), but you can tell boxplot to plot multiple datasets in parallel.

Using the melt() function and base R boxplot:
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(100) + 5)
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
# plot using base R boxplot function
boxplot(value ~ variable, data = df)

Related

How to update fillColor palette to selected input in shiny map?

I am having trouble transitioning my map from static to reactive so a user can select what data they want to look at. Somehow I'm not successfully connecting the input to the dataframe. My data is from a shapefile and looks roughly like this:
NAME Average Rate geometry
1 Alcona 119.7504 0.1421498 MULTIPOLYGON (((-83.88711 4...
2 Alger 120.9212 0.1204398 MULTIPOLYGON (((-87.11602 4...
3 Allegan 128.4523 0.1167062 MULTIPOLYGON (((-85.54342 4...
4 Alpena 114.1528 0.1410852 MULTIPOLYGON (((-83.3434 44...
5 Antrim 124.8554 0.1350004 MULTIPOLYGON (((-84.84877 4...
6 Arenac 127.8809 0.1413534 MULTIPOLYGON (((-83.7555 43...
In the server section below, you can see that I tried to use reactive to get the selected variable and when I write print(select) it does print the correct variable name, but when I try to put it into the colorNumeric() function it's clearly not being recognized. The map I get is all just the same shade of blue instead of different shades based on the value of the variable in that county.
ui <- fluidPage(
fluidRow(
selectInput(inputId="var",
label="Select variable",
choices=list("Average"="Average",
"Rate"="Rate"),
selected=1)
),
fluidRow(
leafletOutput("map")
)
)
server <- function(input, output, session) {
# Data sources
counties <- st_read("EITC_counties.shp") %>%
st_transform(crs="+init=epsg:4326")
counties_clean <- select(counties, NAME, X2020_Avg., X2020_Takeu)
counties_clean <- counties_clean %>%
rename("Average"="X2020_Avg.",
"Rate"="X2020_Takeu")
# Map
variable <- reactive({
input$var
})
output$map <- renderLeaflet({
select <- variable()
print(select)
pal <- colorNumeric(palette = "Blues", domain = counties_clean$select, na.color = "black")
color_pal <- counties_clean$select
leaflet()%>%
setView( -84.51, 44.18, zoom=5) %>%
addPolygons(data=counties_clean, layerId=~NAME,
weight = 1, smoothFactor=.5,
fillOpacity=.7,
fillColor=~pal(color_pal()),
highlightOptions = highlightOptions(color = "white",
weight = 2,
bringToFront = TRUE)) %>%
addProviderTiles(providers$CartoDB.Positron)
})
}
shinyApp(ui, server)
I've tried making the reaction into an event and also using the observe function using a leaflet proxy but it only produced errors. I also tried to skip the reactive definition and just put input$var directly into the palette (counties_clean$input$var), but it similarly did not show any color variation.
When I previously created a static map setting the palette using counties_clean$Average it came out correctly, but replacing Average with a user input is where I appear to be going wrong. Thanks in advance for any guidance you can provide and please let me know if I can share any additional clarification.
Unfortunately, your code is not reproducible without the data, but the mistake is most likely in this line
color_pal <- counties_clean$select
What this line does, is to extract a column named select from your data. This column is not existing, so it will return NULL.
What you want though, is to extract a column whose name is given by the content of select, so you want to try:
color_pal <- counties_clean[[select]]

How to plot PCA with paired data?

I am currently working with genetic data from different patients. To date I have always worked with PCAs by comparing independent groups. Example: (Sick Vs Control, Treatment Vs Control etc.)
But now I have paired data. I mean that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
I did this PCA with Thermofisher program, but I would like to do in R. This is the output of the ThermoFisher program. B (Before treatment) P (Post-treatment)
I tried to looking for any example in Google, but I didn't found it.
An example would be something like this:
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("P_BT", 1:5, sep=""),
paste("P_AT", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
head(data.matrix)

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.
You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris
You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

Rpart - accuracy of bigrams

Good evening, everyone!
I am facing a problem in R. I have a dataset containing Amazon reviews of the Playstation 4 and I would like to create a prediction model with the help of rpart and also would like to have the accuracy of this model.
The reviews have been successfully loaded to R, a corpus has been created and some preprocessing tasks have been applied:
library(RWeka)
library(tm)
library(rpart)
corpus <- Corpus(VectorSource(tr.review.ps4$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
The bigrams and a term document matrix are created with the following code:
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
txtTdmBi <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer, bounds = list(global=c(10, Inf))))
Then sparse-terms are deleted and a matrix is created:
dtm <- removeSparseTerms(txtTdmBi, 0.999)
dtmsparse <- as.data.frame(as.matrix(txtTdmBi))
The original dataset consists of 7561 objects. Therefore a training and test set is created as follows:
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Then the training is done. $overall refers to the star rating from one to five.
train$overall <- tr.review.ps4[1:6500,]$overall
When using unigrams the prediction model is created as follows:
model <- rpart(overall ~., data = train, method= 'class')
However, this is not working in my case because - I guess - the connection to the original review dataset has to be established. But how? I don't have an idea.
When I am entering this code I get following error-output:
Error in terms.formula(formula, data = data) :
Can anyone help me? Thanks a lot.
Best regards
Paul
today I still was searching for a solution of my problem. Luckily I found the mistake.
The errore message occured because the TermDocumentMatrix was in the wrong postion.
I had to transpose the matrix with the following code:
txtTdmBi.t=t(txtTdmBi)
Finally it worked.
Best regards
Paul

add column with row wise mean over selected columns using dplyr

I have a data frame which contains several variables which got measured at different time points (e.g., test1_tp1, test1_tp2, test1_tp3, test2_tp1, test2_tp2,...).
I am now trying to use dplyr to add a new column to a data frame that calculates the row wise mean over a selection of these columns (e.g., mean over all time points for test1).
I struggle even with the syntax for calculating the mean over explicitly named columns. What I tried without success was:
data %>% ... %>% mutate(test1_mean = mean(test1_tp1, test1_tp2, test1_tp3, na.rm = TRUE)
I would further like to use regex/wildcards to select the column names, so something like
data %>% ... %>% mutate(test1_mean = mean(matches("test1_.*"), na.rm = TRUE)
You can use starts_with inside select to find all columns starting with a certain string.
data %>%
mutate(test1 = select(., starts_with("test1_")) %>%
rowMeans(na.rm = TRUE))
Here's how you could do it in dplyr - I use the iris data as an example:
iris %>% mutate(sum.Sepal = rowSums(.[grep("^Sepal", names(.))]))
This computes rowwise sums of all columns that start with "Sepal". You can use rowMeans instead of rowSums the same way.
Not a dplyr solution, but you can try:
cols_2sum <- grepl('test1',colnames(data))
rowMeans(data[,cols_2sum])