Transforming/reshaping a dataset : Ranking College football teams for the last 50 years - regex

I need some help transforming my dataset. I would appreciate any help or feedback.
I have data of College football scores for the last 50 years. I currently have a data frame like in picture 1, and I need to get a data frame similar to picture 2. The data frame I'm trying to get needs to have a concatenated list of all teams played each year, and also two columns that keep track of wins an losses respectively. The concatenated list must be specific for each year. So basically a data frame like picture 2, but has data for every year separately.
This is the code to get a cleaned up dataframe similar to the one I have in picture one.
# Make generic data frame and get data
practice = data.frame('a'=character(), 'b'=character(), 'c'= numeric(), 'd'=character(), 'e'= numeric(), 'f'=character())
widths = c(10, 28, 5, 28, 3, 19)
years = 1960:2010
for (i in years){
football_page = paste('http://homepages.cae.wisc.edu/~dwilson/rsfc/history/howell/cf', i, 'gms.txt',sep = '')
get_data = read.fwf(football_page, widths)
practice = rbind(practice, get_data)
}
heading = list('DATE', 'AWAY TEAM', 'AWAY SCORE', 'HOME TEAM', 'HOME SCORE', 'LOCATION')
colnames(practice) = heading
# Fixing season dates
practice = cbind('SEASON'=numeric(nrow(practice)),practice)
fix_date = matrix(0, nrow = nrow(practice))
for (j in 1:nrow(fix_date)){
fix_date[j,1] = substr(practice[j,2],7,10)
}
fix_date = as.numeric(fix_date)
practice$SEASON = fix_date
for (j in 1:nrow(practice)){
if (grepl('01/.......', practice[j,2]))
practice[j,1] = practice[j,1]-1
}
#fix names
practice[,3]=gsub(' ','',practice[,3])
practice[,5]=gsub(' ','',practice[,5])
#drop location and columns
practice = practice[, -7]
practice = practice[, -2]
The data set is called practice.

I can't thoroughly test it without a sample of your data or something like it, but I think this would do the job.
# Create a function to get win and loss counts by season for a single team
teamsum <- function(teamname) {
require(dplyr)
df <- practice %>%
# Reduce the data set to games involving a single team
filter(AWAY TEAM==teamname | HOME TEAM==teamname) %>%
# Create a 0/1 indicator for whether or not that team won each of those games. Note
# that ties will get treated as losses here; you could change that with a more
# complicated set of if/else statements
mutate(team = teamname,
win = ifelse((AWAY TEAM==teamname & AWAY SCORE > HOME SCORE) |
(HOME TEAM==teamname & HOME SCORE > AWAY SCORE), 1, 0)) %>%
# Group the data by season for the summing to follow
group_by(SEASON) %>%
# Reduce the data to a table with counts of wins and losses by season
summarise(wins = sum(win),
losses = n() - sum(win)) %>%
# Add the team name as an id column to that summary table. In dplyr piping, '.' is
# the object created by the preceding step in the pipeline -- here, that summary
# table of wins and losses.
cbind(team = rep(teamname, nrow(.)), .) %>%
return(df)
}
# Apply that function to a vector of unique team names to make a list with
# tables of win & loss counts by season for each team in the original data.
# This version assumes that every team was the home team at least once.
teamlist <- lapply(unique(practice[,"HOME TEAM"]), teamsum)
# Merge the elements of that list into a single data frame. You could rbind, too.
df <- Reduce(function(...) merge(...), teamlist)

Another dplyr answer
I used your code to get the dataset and then I duplicated the team columns to use as key for reshaping the dataset, you can probably use the same concept to achieve the objective in base R.
library(dplyr)
library(tidyr)
practice_2 <- practice %>%
mutate(home = `HOME TEAM`,
away = `AWAY TEAM`) %>%
# transform dataset to long format with `tidyr::gather()`
gather(LOC, TEAM, 6:7) %>%
group_by(SEASON, TEAM) %>%
mutate(won = ifelse(LOC == "home",
as.numeric(`HOME SCORE` > `AWAY SCORE`),
as.numeric(`AWAY SCORE` > `HOME SCORE`)),
lost = ifelse(LOC == "home",
as.numeric(`HOME SCORE` <= `AWAY SCORE`),
as.numeric(`AWAY SCORE` <= `HOME SCORE`)),
op = ifelse(LOC == "home", `AWAY TEAM`, `HOME TEAM`)) %>%
summarise(WINS = sum(won, na.rm = TRUE),
LOSSES = sum(lost, na.rm = TRUE),
OPPONENTS = list(unique(op)))

Related

How to update fillColor palette to selected input in shiny map?

I am having trouble transitioning my map from static to reactive so a user can select what data they want to look at. Somehow I'm not successfully connecting the input to the dataframe. My data is from a shapefile and looks roughly like this:
NAME Average Rate geometry
1 Alcona 119.7504 0.1421498 MULTIPOLYGON (((-83.88711 4...
2 Alger 120.9212 0.1204398 MULTIPOLYGON (((-87.11602 4...
3 Allegan 128.4523 0.1167062 MULTIPOLYGON (((-85.54342 4...
4 Alpena 114.1528 0.1410852 MULTIPOLYGON (((-83.3434 44...
5 Antrim 124.8554 0.1350004 MULTIPOLYGON (((-84.84877 4...
6 Arenac 127.8809 0.1413534 MULTIPOLYGON (((-83.7555 43...
In the server section below, you can see that I tried to use reactive to get the selected variable and when I write print(select) it does print the correct variable name, but when I try to put it into the colorNumeric() function it's clearly not being recognized. The map I get is all just the same shade of blue instead of different shades based on the value of the variable in that county.
ui <- fluidPage(
fluidRow(
selectInput(inputId="var",
label="Select variable",
choices=list("Average"="Average",
"Rate"="Rate"),
selected=1)
),
fluidRow(
leafletOutput("map")
)
)
server <- function(input, output, session) {
# Data sources
counties <- st_read("EITC_counties.shp") %>%
st_transform(crs="+init=epsg:4326")
counties_clean <- select(counties, NAME, X2020_Avg., X2020_Takeu)
counties_clean <- counties_clean %>%
rename("Average"="X2020_Avg.",
"Rate"="X2020_Takeu")
# Map
variable <- reactive({
input$var
})
output$map <- renderLeaflet({
select <- variable()
print(select)
pal <- colorNumeric(palette = "Blues", domain = counties_clean$select, na.color = "black")
color_pal <- counties_clean$select
leaflet()%>%
setView( -84.51, 44.18, zoom=5) %>%
addPolygons(data=counties_clean, layerId=~NAME,
weight = 1, smoothFactor=.5,
fillOpacity=.7,
fillColor=~pal(color_pal()),
highlightOptions = highlightOptions(color = "white",
weight = 2,
bringToFront = TRUE)) %>%
addProviderTiles(providers$CartoDB.Positron)
})
}
shinyApp(ui, server)
I've tried making the reaction into an event and also using the observe function using a leaflet proxy but it only produced errors. I also tried to skip the reactive definition and just put input$var directly into the palette (counties_clean$input$var), but it similarly did not show any color variation.
When I previously created a static map setting the palette using counties_clean$Average it came out correctly, but replacing Average with a user input is where I appear to be going wrong. Thanks in advance for any guidance you can provide and please let me know if I can share any additional clarification.
Unfortunately, your code is not reproducible without the data, but the mistake is most likely in this line
color_pal <- counties_clean$select
What this line does, is to extract a column named select from your data. This column is not existing, so it will return NULL.
What you want though, is to extract a column whose name is given by the content of select, so you want to try:
color_pal <- counties_clean[[select]]

decision trees using R, rpart, fragile families

So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!

How to filter rows based on column value while running a loop in pandas dataframe

I have a dataset with 13 features and 1 label column with only two outcomes Income =< 50k or > 50k.
I am trying to see the distribution of values for each feature for the entire dataset vs the same feature but only with >50k cases to see how the distribution changes or not for that given subset.
if i do:
filtertable = table[table[column] == criteria]
that works well to get the subset
However when used inside a function:
def comparacion(tabla, columna, criterio):
completa = {}
criteria = {}
datos = tabla[tabla[columna] == criterio] #<- here is the problem
datos = tabla.drop(columna, axis=1)
titulos = datos.columns
for tit in titulos:
completa[tit] =
(tabla[tit].value_counts().astype(float))/len(tabla[tit])
criteria[tit] =
(datos[tit].value_counts().astype(float))/len(datos[tit])
return completa, criteria
For some reason the filtering does not work, any ideas what could it be the problem?

add column with row wise mean over selected columns using dplyr

I have a data frame which contains several variables which got measured at different time points (e.g., test1_tp1, test1_tp2, test1_tp3, test2_tp1, test2_tp2,...).
I am now trying to use dplyr to add a new column to a data frame that calculates the row wise mean over a selection of these columns (e.g., mean over all time points for test1).
I struggle even with the syntax for calculating the mean over explicitly named columns. What I tried without success was:
data %>% ... %>% mutate(test1_mean = mean(test1_tp1, test1_tp2, test1_tp3, na.rm = TRUE)
I would further like to use regex/wildcards to select the column names, so something like
data %>% ... %>% mutate(test1_mean = mean(matches("test1_.*"), na.rm = TRUE)
You can use starts_with inside select to find all columns starting with a certain string.
data %>%
mutate(test1 = select(., starts_with("test1_")) %>%
rowMeans(na.rm = TRUE))
Here's how you could do it in dplyr - I use the iris data as an example:
iris %>% mutate(sum.Sepal = rowSums(.[grep("^Sepal", names(.))]))
This computes rowwise sums of all columns that start with "Sepal". You can use rowMeans instead of rowSums the same way.
Not a dplyr solution, but you can try:
cols_2sum <- grepl('test1',colnames(data))
rowMeans(data[,cols_2sum])

How do you combine multiple boxplots from a List of data-frames?

This is a repost from the Statistics portion of the Stack Exchange. I had asked the question there, I was advised to ask this question here. So here it is.
I have a list of data-frames. Each data-frame has a similar structure. There is only one column in each data-frame that is numeric. Because of my data-requirements it is essential that each data-frame has different lengths. I want to create a boxplot of the numerical values, categorized over the attributes in another column. But the boxplot should include information from all the data-frames.
I hope it is a clear question. I will post sample data soon.
Sam,
I'm assuming this is a follow up to this question? Maybe your sample data will illustrate the nuances of your needs better (the "categorized over attributes in another column" part), but the same melting approach should work here.
library(ggplot2)
library(reshape2)
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(1000))
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
#Separate boxplots for each data.frame
qplot(factor(variable), value, data = df, geom = "boxplot")
#All values plotted together as one boxplot
qplot(factor(1), value, data = df, geom = "boxplot")
a<-data.frame(c(1,2),c("x","y"))
b<-data.frame(c(3,4,5),c("a","b","c"))
boxplot(c(a[1],b[1]))
With the "1"'s i select the column i want out of the data-frame.
A data-frames can not have different column-lengths (has to have same number of rows for each column), but you can tell boxplot to plot multiple datasets in parallel.
Using the melt() function and base R boxplot:
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(100) + 5)
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
# plot using base R boxplot function
boxplot(value ~ variable, data = df)