add column with row wise mean over selected columns using dplyr - regex

I have a data frame which contains several variables which got measured at different time points (e.g., test1_tp1, test1_tp2, test1_tp3, test2_tp1, test2_tp2,...).
I am now trying to use dplyr to add a new column to a data frame that calculates the row wise mean over a selection of these columns (e.g., mean over all time points for test1).
I struggle even with the syntax for calculating the mean over explicitly named columns. What I tried without success was:
data %>% ... %>% mutate(test1_mean = mean(test1_tp1, test1_tp2, test1_tp3, na.rm = TRUE)
I would further like to use regex/wildcards to select the column names, so something like
data %>% ... %>% mutate(test1_mean = mean(matches("test1_.*"), na.rm = TRUE)

You can use starts_with inside select to find all columns starting with a certain string.
data %>%
mutate(test1 = select(., starts_with("test1_")) %>%
rowMeans(na.rm = TRUE))

Here's how you could do it in dplyr - I use the iris data as an example:
iris %>% mutate(sum.Sepal = rowSums(.[grep("^Sepal", names(.))]))
This computes rowwise sums of all columns that start with "Sepal". You can use rowMeans instead of rowSums the same way.

Not a dplyr solution, but you can try:
cols_2sum <- grepl('test1',colnames(data))
rowMeans(data[,cols_2sum])

Related

How to update fillColor palette to selected input in shiny map?

I am having trouble transitioning my map from static to reactive so a user can select what data they want to look at. Somehow I'm not successfully connecting the input to the dataframe. My data is from a shapefile and looks roughly like this:
NAME Average Rate geometry
1 Alcona 119.7504 0.1421498 MULTIPOLYGON (((-83.88711 4...
2 Alger 120.9212 0.1204398 MULTIPOLYGON (((-87.11602 4...
3 Allegan 128.4523 0.1167062 MULTIPOLYGON (((-85.54342 4...
4 Alpena 114.1528 0.1410852 MULTIPOLYGON (((-83.3434 44...
5 Antrim 124.8554 0.1350004 MULTIPOLYGON (((-84.84877 4...
6 Arenac 127.8809 0.1413534 MULTIPOLYGON (((-83.7555 43...
In the server section below, you can see that I tried to use reactive to get the selected variable and when I write print(select) it does print the correct variable name, but when I try to put it into the colorNumeric() function it's clearly not being recognized. The map I get is all just the same shade of blue instead of different shades based on the value of the variable in that county.
ui <- fluidPage(
fluidRow(
selectInput(inputId="var",
label="Select variable",
choices=list("Average"="Average",
"Rate"="Rate"),
selected=1)
),
fluidRow(
leafletOutput("map")
)
)
server <- function(input, output, session) {
# Data sources
counties <- st_read("EITC_counties.shp") %>%
st_transform(crs="+init=epsg:4326")
counties_clean <- select(counties, NAME, X2020_Avg., X2020_Takeu)
counties_clean <- counties_clean %>%
rename("Average"="X2020_Avg.",
"Rate"="X2020_Takeu")
# Map
variable <- reactive({
input$var
})
output$map <- renderLeaflet({
select <- variable()
print(select)
pal <- colorNumeric(palette = "Blues", domain = counties_clean$select, na.color = "black")
color_pal <- counties_clean$select
leaflet()%>%
setView( -84.51, 44.18, zoom=5) %>%
addPolygons(data=counties_clean, layerId=~NAME,
weight = 1, smoothFactor=.5,
fillOpacity=.7,
fillColor=~pal(color_pal()),
highlightOptions = highlightOptions(color = "white",
weight = 2,
bringToFront = TRUE)) %>%
addProviderTiles(providers$CartoDB.Positron)
})
}
shinyApp(ui, server)
I've tried making the reaction into an event and also using the observe function using a leaflet proxy but it only produced errors. I also tried to skip the reactive definition and just put input$var directly into the palette (counties_clean$input$var), but it similarly did not show any color variation.
When I previously created a static map setting the palette using counties_clean$Average it came out correctly, but replacing Average with a user input is where I appear to be going wrong. Thanks in advance for any guidance you can provide and please let me know if I can share any additional clarification.
Unfortunately, your code is not reproducible without the data, but the mistake is most likely in this line
color_pal <- counties_clean$select
What this line does, is to extract a column named select from your data. This column is not existing, so it will return NULL.
What you want though, is to extract a column whose name is given by the content of select, so you want to try:
color_pal <- counties_clean[[select]]

How to filter databases in list based on column with different name

I have a list that includes different databases with different informations.
The first column of every database includes the informations that I need to create graphs. I need to filter information based on external vector referred to first column.
For example:
mtcars2 <- mtcars %>% rownames_to_column("cars_model") %>% as.data.frame()
mtcars3 <- mtcars %>% rownames_to_column("cars_model_second") %>% as.data.frame()
list_two_database <- list(mtcars2, mtcars3)
model_to_select <- c("Fiat 128", "Honda Civic", "Lotus Europa")
Is there a way to filter the list based on THE FIRST COLUMN OF EACH DATABASE included in the list (cars_model and cars_model_second) WITHOUT RENAME THE COLUMN ITSELF?
The goal is to obtain a list that includes the two databases each with the three model.
Thank you in advance
The following works by extracting the first column name as a string first_col and then converting this string into a form that can be used within dplyr:
mtcars2 <- mtcars %>% rownames_to_column("cars_model") %>% as.data.frame()
mtcars3 <- mtcars %>% rownames_to_column("cars_model_second") %>% as.data.frame()
list_two_database <- list(mtcars2, mtcars3)
model_to_select <- c("Fiat 128", "Honda Civic", "Lotus Europa")
func = function(df){
first_col = colnames(df)[1]
filter(df, !!sym(first_col) %in% model_to_select)
}
lapply(list_two_database, func)
Notes:
sym(.) is used to turn a text string into a symbol
!! only works inside dplyr commands and turns symbols into variables
Used together you have something like:
var = "my_col"
df %>% filter(!!sym(var) == 1)
Which is equivalent to df %>% filter(my_col == 1)

kableExtra: set table width on ioslides_presentation

I am using kableExtra on a ioslides_presentation R markdown file.
The table however does not fit the width of the slide:
I am using the following code to generate the table:
library(kable)
library(kableExtra)
data %>%
kable() %>%
kable_styling(c("striped", "hover", "condensed", "responsive"),full_width = FALSE)
I've also tried the argument full_width = TRUE, but was not successful.
How can I force this to happen automatically?
It turns out that it is possible to add a scroll bar to the table.
This can be done using the scroll_box function as follows:
library(kable)
library(kableExtra)
data %>%
kable() %>%
kable_styling(c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%")

Transforming/reshaping a dataset : Ranking College football teams for the last 50 years

I need some help transforming my dataset. I would appreciate any help or feedback.
I have data of College football scores for the last 50 years. I currently have a data frame like in picture 1, and I need to get a data frame similar to picture 2. The data frame I'm trying to get needs to have a concatenated list of all teams played each year, and also two columns that keep track of wins an losses respectively. The concatenated list must be specific for each year. So basically a data frame like picture 2, but has data for every year separately.
This is the code to get a cleaned up dataframe similar to the one I have in picture one.
# Make generic data frame and get data
practice = data.frame('a'=character(), 'b'=character(), 'c'= numeric(), 'd'=character(), 'e'= numeric(), 'f'=character())
widths = c(10, 28, 5, 28, 3, 19)
years = 1960:2010
for (i in years){
football_page = paste('http://homepages.cae.wisc.edu/~dwilson/rsfc/history/howell/cf', i, 'gms.txt',sep = '')
get_data = read.fwf(football_page, widths)
practice = rbind(practice, get_data)
}
heading = list('DATE', 'AWAY TEAM', 'AWAY SCORE', 'HOME TEAM', 'HOME SCORE', 'LOCATION')
colnames(practice) = heading
# Fixing season dates
practice = cbind('SEASON'=numeric(nrow(practice)),practice)
fix_date = matrix(0, nrow = nrow(practice))
for (j in 1:nrow(fix_date)){
fix_date[j,1] = substr(practice[j,2],7,10)
}
fix_date = as.numeric(fix_date)
practice$SEASON = fix_date
for (j in 1:nrow(practice)){
if (grepl('01/.......', practice[j,2]))
practice[j,1] = practice[j,1]-1
}
#fix names
practice[,3]=gsub(' ','',practice[,3])
practice[,5]=gsub(' ','',practice[,5])
#drop location and columns
practice = practice[, -7]
practice = practice[, -2]
The data set is called practice.
I can't thoroughly test it without a sample of your data or something like it, but I think this would do the job.
# Create a function to get win and loss counts by season for a single team
teamsum <- function(teamname) {
require(dplyr)
df <- practice %>%
# Reduce the data set to games involving a single team
filter(AWAY TEAM==teamname | HOME TEAM==teamname) %>%
# Create a 0/1 indicator for whether or not that team won each of those games. Note
# that ties will get treated as losses here; you could change that with a more
# complicated set of if/else statements
mutate(team = teamname,
win = ifelse((AWAY TEAM==teamname & AWAY SCORE > HOME SCORE) |
(HOME TEAM==teamname & HOME SCORE > AWAY SCORE), 1, 0)) %>%
# Group the data by season for the summing to follow
group_by(SEASON) %>%
# Reduce the data to a table with counts of wins and losses by season
summarise(wins = sum(win),
losses = n() - sum(win)) %>%
# Add the team name as an id column to that summary table. In dplyr piping, '.' is
# the object created by the preceding step in the pipeline -- here, that summary
# table of wins and losses.
cbind(team = rep(teamname, nrow(.)), .) %>%
return(df)
}
# Apply that function to a vector of unique team names to make a list with
# tables of win & loss counts by season for each team in the original data.
# This version assumes that every team was the home team at least once.
teamlist <- lapply(unique(practice[,"HOME TEAM"]), teamsum)
# Merge the elements of that list into a single data frame. You could rbind, too.
df <- Reduce(function(...) merge(...), teamlist)
Another dplyr answer
I used your code to get the dataset and then I duplicated the team columns to use as key for reshaping the dataset, you can probably use the same concept to achieve the objective in base R.
library(dplyr)
library(tidyr)
practice_2 <- practice %>%
mutate(home = `HOME TEAM`,
away = `AWAY TEAM`) %>%
# transform dataset to long format with `tidyr::gather()`
gather(LOC, TEAM, 6:7) %>%
group_by(SEASON, TEAM) %>%
mutate(won = ifelse(LOC == "home",
as.numeric(`HOME SCORE` > `AWAY SCORE`),
as.numeric(`AWAY SCORE` > `HOME SCORE`)),
lost = ifelse(LOC == "home",
as.numeric(`HOME SCORE` <= `AWAY SCORE`),
as.numeric(`AWAY SCORE` <= `HOME SCORE`)),
op = ifelse(LOC == "home", `AWAY TEAM`, `HOME TEAM`)) %>%
summarise(WINS = sum(won, na.rm = TRUE),
LOSSES = sum(lost, na.rm = TRUE),
OPPONENTS = list(unique(op)))

How do you combine multiple boxplots from a List of data-frames?

This is a repost from the Statistics portion of the Stack Exchange. I had asked the question there, I was advised to ask this question here. So here it is.
I have a list of data-frames. Each data-frame has a similar structure. There is only one column in each data-frame that is numeric. Because of my data-requirements it is essential that each data-frame has different lengths. I want to create a boxplot of the numerical values, categorized over the attributes in another column. But the boxplot should include information from all the data-frames.
I hope it is a clear question. I will post sample data soon.
Sam,
I'm assuming this is a follow up to this question? Maybe your sample data will illustrate the nuances of your needs better (the "categorized over attributes in another column" part), but the same melting approach should work here.
library(ggplot2)
library(reshape2)
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(1000))
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
#Separate boxplots for each data.frame
qplot(factor(variable), value, data = df, geom = "boxplot")
#All values plotted together as one boxplot
qplot(factor(1), value, data = df, geom = "boxplot")
a<-data.frame(c(1,2),c("x","y"))
b<-data.frame(c(3,4,5),c("a","b","c"))
boxplot(c(a[1],b[1]))
With the "1"'s i select the column i want out of the data-frame.
A data-frames can not have different column-lengths (has to have same number of rows for each column), but you can tell boxplot to plot multiple datasets in parallel.
Using the melt() function and base R boxplot:
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(100) + 5)
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
# plot using base R boxplot function
boxplot(value ~ variable, data = df)