decision trees using R, rpart, fragile families - rpart

So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!

Related

Multinomial Logit Fixed Effects: Stata and R

I am trying to run a multinomial logit with year fixed effects in mlogit in Stata (panel data: year-country), but I do not get standard errors for some of the models. When I run the same model using multinom in R I get both coefficients and standard errors.
I do not use Stata frequently, so I may be missing something or I may be running different models in Stata and R and should not be comparing them in the first place. What may be happening?
So a few details about the simple version of the model of interest:
I created a data example to show what the problem is
Dependent variable (will call it DV1) with 3 categories of -1, 0, 1 (unordered and 0 as reference)
Independent variables: 2 continuous variables, 3 binary variables, interaction of 2 of the 3 binary variables
Years: 1995-2003
Number of observations in the model: 900
In R I run the code and get coefficients and standard errors as below.
R version of code creating data and running the model:
## Fabricate example data
library(fabricatr)
data <- fabricate(
N = 900,
id = rep(1:900, 1),
IV1 = draw_binary(0.5, N = N),
IV2 = draw_binary(0.5, N = N),
IV3 = draw_binary(0.5, N = N),
IV4 = draw_normal_icc(mean = 3, N = N, clusters = id, ICC = 0.99),
IV5 = draw_normal_icc(mean = 6, N = N, clusters = id, ICC = 0.99))
library(AlgDesign)
DV = gen.factorial(c(3), 1, center=TRUE, varNames=c("DV"))
year = gen.factorial(c(9), 1, center=TRUE, varNames=c("year"))
DV = do.call("rbind", replicate(300, DV, simplify = FALSE))
year = do.call("rbind", replicate(100, year, simplify = FALSE))
year[year==-4]= 1995
year[year==-3]= 1996
year[year==-2]= 1997
year[year==-1]= 1998
year[year==0]= 1999
year[year==1]= 2000
year[year==2]= 2001
year[year==3]= 2002
year[year==4]= 2003
data1=cbind(data, DV, year)
data1$DV1 = relevel(factor(data1$DV), ref = "0")
## Save data as csv file (to use in Stata)
library(foreign)
write.csv(data1, "datafile.csv", row.names=FALSE)
## Run multinom
library(nnet)
model1 <- multinom(DV1 ~ IV1 + IV2 + IV3 + IV4 + IV5 + IV1*IV2 + as.factor(year), data = data1)
Results from R
When I run the model using mlogit (without fixed effects) in Stata I get both coefficients and standard errors.
So I tried including year fixed effects in the model using Stata three different ways and none worked:
femlogit
factor-variable and time-series operators not allowed
depvar and indepvars may not contain factor variables or time-series operators
mlogit
fe option: fe not allowed
used i.year: omits certain variables and/or does not give me standard errors and only shows coefficients (example in code below)
* Read file
import delimited using "datafile.csv", clear case(preserve)
* Run regression
mlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2 i.year, base(0) iterate(1000)
Stata results
xtmlogit
error - does not run
error message: total number of permutations is 2,389,461,218; this many permutations require a considerable amount of memory and can result in long run times; use option force to proceed anyway, or consider using option rsample()
Fixed effects and non-linear models (such as logits) are an awkward combination. In a linear model you can simply add dummies/demean to get rid of a group-specific intercept, but in a non-linear model none of that works. I mean you could do it technically (which I think is what the R code is doing) but conceptually it is very unclear what that actually does.
Econometricians have spent a lot of time working on this, which has led to some work-arounds, usually referred to as conditional logit. IIRC this is what's implemented in femlogit. I think the mistake in your code is that you tried to include the fixed effects through a dummy specification (i.year). Instead, you should xtset your data and then run femlogit without the dummies.
xtset year
femlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2
Note that these conditional logit models can be very slow. Personally, I'm more a fan of running two one-vs-all linear regressions (1=1 and 0/-1 set to zero, then -1=1 and 0/1 set to zero). However, opinions are divided (Wooldridge appears to be a fan too, many others very much not so).

Transforming/reshaping a dataset : Ranking College football teams for the last 50 years

I need some help transforming my dataset. I would appreciate any help or feedback.
I have data of College football scores for the last 50 years. I currently have a data frame like in picture 1, and I need to get a data frame similar to picture 2. The data frame I'm trying to get needs to have a concatenated list of all teams played each year, and also two columns that keep track of wins an losses respectively. The concatenated list must be specific for each year. So basically a data frame like picture 2, but has data for every year separately.
This is the code to get a cleaned up dataframe similar to the one I have in picture one.
# Make generic data frame and get data
practice = data.frame('a'=character(), 'b'=character(), 'c'= numeric(), 'd'=character(), 'e'= numeric(), 'f'=character())
widths = c(10, 28, 5, 28, 3, 19)
years = 1960:2010
for (i in years){
football_page = paste('http://homepages.cae.wisc.edu/~dwilson/rsfc/history/howell/cf', i, 'gms.txt',sep = '')
get_data = read.fwf(football_page, widths)
practice = rbind(practice, get_data)
}
heading = list('DATE', 'AWAY TEAM', 'AWAY SCORE', 'HOME TEAM', 'HOME SCORE', 'LOCATION')
colnames(practice) = heading
# Fixing season dates
practice = cbind('SEASON'=numeric(nrow(practice)),practice)
fix_date = matrix(0, nrow = nrow(practice))
for (j in 1:nrow(fix_date)){
fix_date[j,1] = substr(practice[j,2],7,10)
}
fix_date = as.numeric(fix_date)
practice$SEASON = fix_date
for (j in 1:nrow(practice)){
if (grepl('01/.......', practice[j,2]))
practice[j,1] = practice[j,1]-1
}
#fix names
practice[,3]=gsub(' ','',practice[,3])
practice[,5]=gsub(' ','',practice[,5])
#drop location and columns
practice = practice[, -7]
practice = practice[, -2]
The data set is called practice.
I can't thoroughly test it without a sample of your data or something like it, but I think this would do the job.
# Create a function to get win and loss counts by season for a single team
teamsum <- function(teamname) {
require(dplyr)
df <- practice %>%
# Reduce the data set to games involving a single team
filter(AWAY TEAM==teamname | HOME TEAM==teamname) %>%
# Create a 0/1 indicator for whether or not that team won each of those games. Note
# that ties will get treated as losses here; you could change that with a more
# complicated set of if/else statements
mutate(team = teamname,
win = ifelse((AWAY TEAM==teamname & AWAY SCORE > HOME SCORE) |
(HOME TEAM==teamname & HOME SCORE > AWAY SCORE), 1, 0)) %>%
# Group the data by season for the summing to follow
group_by(SEASON) %>%
# Reduce the data to a table with counts of wins and losses by season
summarise(wins = sum(win),
losses = n() - sum(win)) %>%
# Add the team name as an id column to that summary table. In dplyr piping, '.' is
# the object created by the preceding step in the pipeline -- here, that summary
# table of wins and losses.
cbind(team = rep(teamname, nrow(.)), .) %>%
return(df)
}
# Apply that function to a vector of unique team names to make a list with
# tables of win & loss counts by season for each team in the original data.
# This version assumes that every team was the home team at least once.
teamlist <- lapply(unique(practice[,"HOME TEAM"]), teamsum)
# Merge the elements of that list into a single data frame. You could rbind, too.
df <- Reduce(function(...) merge(...), teamlist)
Another dplyr answer
I used your code to get the dataset and then I duplicated the team columns to use as key for reshaping the dataset, you can probably use the same concept to achieve the objective in base R.
library(dplyr)
library(tidyr)
practice_2 <- practice %>%
mutate(home = `HOME TEAM`,
away = `AWAY TEAM`) %>%
# transform dataset to long format with `tidyr::gather()`
gather(LOC, TEAM, 6:7) %>%
group_by(SEASON, TEAM) %>%
mutate(won = ifelse(LOC == "home",
as.numeric(`HOME SCORE` > `AWAY SCORE`),
as.numeric(`AWAY SCORE` > `HOME SCORE`)),
lost = ifelse(LOC == "home",
as.numeric(`HOME SCORE` <= `AWAY SCORE`),
as.numeric(`AWAY SCORE` <= `HOME SCORE`)),
op = ifelse(LOC == "home", `AWAY TEAM`, `HOME TEAM`)) %>%
summarise(WINS = sum(won, na.rm = TRUE),
LOSSES = sum(lost, na.rm = TRUE),
OPPONENTS = list(unique(op)))

PCA for dimensionality reduction before Random Forest

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?
I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
Many thanks in advance.
My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA
The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.
#I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.
The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.
The souce-code suggestion for PCA-RF:
#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
f.args=as.list(match.call()[-1])
pca_obj = princomp(x)
rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
out=mget(ls())
class(out) = "PCA_RF"
return(out)
}
#print method
print.PCA_RF = function(object) print(object$rf_obj)
#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
print("predicting PCA_RF")
f.args=as.list(match.call()[-1])
if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
newdata = sXtest), #newdata input, see help(predict.randomForest)
f.args[-1:-2]))) #any other parameters are passed to predict.randomForest
}
#testTrain predict #
make.component.data = function(
inter.component.variance = .9,
n.real.components = 5,
nVar.per.component = 20,
nObs=600,
noise.factor=.2,
hidden.function = function(x) apply(x,1,mean),
plot_PCA =T
){
Sigma=matrix(inter.component.variance,
ncol=nVar.per.component,
nrow=nVar.per.component)
diag(Sigma) = 1
x = do.call(cbind,replicate(n = n.real.components,
expr = {mvrnorm(n=nObs,
mu=rep(0,nVar.per.component),
Sigma=Sigma)},
simplify = FALSE)
)
if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
y = hidden.function(x)
ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
out = list(x=x,y=ynoised)
pars = ls()[!ls() %in% c("x","y","Sigma")]
attr(out,"pars") = mget(pars) #attach all pars as attributes
return(out)
}
A run code example:
#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)
Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[ 1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])
rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)
rf
rf2
pred_rf = predict(rf ,test$x)
pred_rf2 = predict(rf2,test$x)
cat("rf, R^2:",cor(test$y,pred_rf )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)
cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2
pairs(list(trueY = test$y,
native_rf = pred_rf,
PCA_RF = pred_rf2)
)
You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.
Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.

Parsing text file into a Data Frame

I have a text file which has information, like so:
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: A1QA985ULVCQOB
review/profileName: Carleen M. Amadio "Lady Dragonfly"
review/helpfulness: 2/2
review/score: 5.0
review/time: 1314057600
review/summary: Fun for adults too!
review/text: I really enjoy these scissors for my inspiration books that I am making (like collage, but in books) and using these different textures these give is just wonderful, makes a great statement with the pictures and sayings. Want more, perfect for any need you have even for gifts as well. Pretty cool!
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: ALCX2ELNHLQA7
review/profileName: Barbara
review/helpfulness: 0/0
review/score: 5.0
review/time: 1328659200
review/summary: Making the cut!
review/text: Looked all over in art supply and other stores for "crazy cutting" scissors for my 4-year old grandson. These are exactly what I was looking for - fun, very well made, metal rather than plastic blades (so they actually do a good job of cutting paper), safe ("blunt") ends, etc. (These really are for age 4 and up, not younger.) Very high quality. Very pleased with the product.
I want to parse this into a dataframe with the productID, title, price.. as columns and the data as the rows. How can I do this in R?
A quick and dirty approach:
mytable <- read.table(text=mytxt, sep = ":")
mytable$id <- rep(1:2, each = 10)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")
There will be issues if there are other colons in the data. Also assumes that there is an equal number (10) of variables for each case. All

Identify subsequent event windows (or occurrences) for each individual

This question is in the context of twoway line with the by() option, but I think the bigger problem is how to identify a second (and all subsequent) event windows without a priori knowing every event window.
Below I generate some data with five countries over the 1990s and 2000s. In all countries an event occurs in 1995 and in Canada only the event repeats in 2005. I would like to plot outcome over the five years centered on each event in each country. If I do this using twoway line and by(), then Canada plots twice in the same plot window.
clear
set obs 100
generate year = 1990 + mod(_n, 20)
generate country = "United Kingdom" in 1/20
replace country = "United States" in 21/40
replace country = "Canada" in 41/60
replace country = "Australia" in 61/80
replace country = "New Zealand" in 81/100
generate event = (year == 1995) ///
| ((year == 2005) & (country == "Canada"))
generate time_to_event = 0 if (event == 1)
generate outcome = runiform()
encode country, generate(countryn)
xtset countryn year
forvalue i = 1/2 {
replace time_to_event = `i' if (l`i'.event == 1)
replace time_to_event = -`i' if (f`i'.event == 1)
}
twoway line outcome time_to_event, ///
by(country) name(orig, replace)
A manual solution adds an occurrence variable that numbers each event occurrence by country, then adds occurrence to the by() option.
generate occurrence = 1 if !missing(time_to_event)
replace occurrence = 2 if ///
(inrange(year, 2005 - 2, 2005 + 2) & (country == "Canada"))
twoway line outcome time_to_event, ///
by(country occurrence) name(attempt, replace)
This works great in the play data, but in my real data there are many more countries and many more events. I can manually code this occurrence variable, but that is tedious (and now I'm really curious if there's a tool or logic that works :) ).
Is there a logic to automate identifying windows? Or one that at least works with twoway line? Thanks!
You have generated a variable time_to_event which is -2 .. 2 in a window and missing otherwise. You can use tsspell from SSC, installed by
ssc inst tsspell
to label such windows. Windows are defined by spells or runs of observations all non-missing on that time_to_event:
tsspell, cond(time_to_event < .)
tsspell requires a prior tsset and generates three variables explained in its help. You can then renumber windows by using one of those variables _seq (sequence number within spell, numbered 1 up)
gen _spell2 = (_seq > 0) * sum(_seq == 1)
and then label spells distinctly by using country and the spell identifier for each spell from _spell, another variable produced by tsspell:
egen gspell = group(country _spell) if _spell2, label
My code assumes that windows are disjoint and cannot overlap, but that seems to be one of your assumptions too. Some technique for handling spells is given by http://www.stata-journal.com/sjpdf.html?articlenum=dm0029 That article does not mention tsspell, which in essence is an implementation of its principles. I started explaining the principles, but the article got long enough before I could explain the program. As the help of tsspell is quite detailed, I doubt that a sequel paper is needed, or at least that it will be written.
(LATER) This code also assumes that windows don't touch. Solving that problem suggests a more direct approach not involving tsspell at all:
bysort country (year) : gen w_id = (time_to_event < .) * sum(time_to_event == -2)
egen w_label = group(country w_id) if w_id, label