R: Regress all variables that match certain pattern - regex

Is there a way in R to add all variables into a regression that match a certain pattern? For example, I have a bunch of variables in my dataset that correspond to holidays with the prefix h_ and I have other variables with other prefixes such as a_
Is there a way to do something like this:
lm(homicide ~ w_* + a_*, data= df)

To programmatically construct a formula, have a look at reformulate().
Here's an example that uses grep() to find all variables that begin with a "d" and then uses reformulate() to plug them in as the regressor variables on the RHS of a formula object.
vv <- grep("^d.*", names(mtcars), value=TRUE)
ff <- reformulate(termlabels=vv, response="mpg")
lm(ff, data=mtcars)
#
# Call:
# lm(formula = ff, data = mtcars)
#
# Coefficients:
# (Intercept) disp drat
# 21.84488 -0.03569 1.80203

A string can be turned into a formula.
data(iris)
fmla <- as.formula(paste("Species ~",
paste(grep("Width", names(iris), value = TRUE), collapse = " + ")))
glm(fmla, data = iris, family = binomial(link = "logit"))

Related

How can I extract model components from glmerMod objects if $ operator not defined for this S4 class

I'd like to extract the model call for a list of models. The list includes both glm and glmm.
It seems like this works for glm objects, but not for glmerMod objects
gm1 <- glm(cbind(incidence, size - incidence) ~ period,
data = cbpp, family = binomial)
gm1$call
>glm(formula = cbind(incidence, size - incidence) ~ period, family = binomial,
data = cbpp)
gm2 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
gm2$call
>Error in gm2$call : $ operator not defined for this S4 class
I can see when I click on the glmerMod objects in the rstudio environment pane that the object does appear to have a call value associated with it.
Is there a way I could extract this information from both types of models using the same function? This would be my preference becasue I'm hoping to set it up as an lapply function to apply to a list of models.
models <- list(gm1, gm2)
calls <- lapply(models, function(x) x$call)
You don't need summary() at all.
getCall(lmm) and getCall(lm) both work. (If you dig into this you'll see that the default method for getCall is just getElement(., "call"); getElement is a utility function that looks for an element as either a list element ($...) or an S4 slot (#...).)
You can get the call with summary(gm2)$call
I imagine something like this would work:
library(lme4)
head(iris)
lmm<-lmer(Petal.Length~Petal.Width+(1|Species),data=iris)
lm<-lm(Petal.Length~Petal.Width+Species,data=iris)
lm$call
summary(lmm)$call
getcall<-function(x){if(class(x)=="lmerMod"|class(x)=="glmerMod"){return(summary(x)$call)}else{return(x$call)}}
getcall(lm)
## lm(formula = Petal.Length ~ Petal.Width + Species, data = iris)
getcall(lmm)
## glmer(formula = Petal.Length ~ Petal.Width + (1 | Species), data = iris,
family = Gamma)
Now add your lapply code:
models <- list(lm, lmm)
calls <- lapply(models, getcall)
calls
## [[1]]
## lm(formula = Petal.Length ~ Petal.Width + Species, data = iris)
## [[2]]
## glmer(formula = Petal.Length ~ Petal.Width + (1 | Species), data = iris,
family = Gamma)

gsub in columns value in dataframe

I have a file with multiple columns. I am showing two columns in which I am interested two columns
Probe.Set.ID Entrez.Gene
A01157cds_s_at 50682
A03913cds_s_at 29366
A04674cds_s_at 24860 /// 100909612
A07543cds_s_at 24867
A09811cds_s_at 25662
---- ----
A16585cds_s_at 25616
I need to replace /// with "\t"(tab) and the output should be like
A01157cds_s_at;50682
A03913cds_s_at;29366
A04674cds_s_at;24860 100909612
Also, I need to avoid the ones with "---"
Here is slightly more different approach using dplyr:
data <- data.frame(Probe.Set.ID = c("A01157cds_s_at",
"A03913cds_s_at",
"A04674cds_s_at",
"A07543cds_s_at",
"A09811cds_s_at",
"----",
"A16585cds_s_at"),
Entrez.Gene = c("50682",
"29366",
"24860 /// 100909612",
"24867",
"25662",
"----",
"25616")
)
if(!require(dplyr)) install.packages("dplyr")
library(dplyr)
data %>%
filter(Entrez.Gene != "----") %>%
mutate(new_column = paste(Probe.Set.ID,
gsub("///", "\t", Entrez.Gene),
sep = ";"
)
) %>% select(new_column)
Looks like you will want to subset the data, then paste the two columns together, then use gsub to make the replace the '///'. Here is what I came up with, with dat being the dataframe containing the two columns.
dat = dat[dat$Probe.Set.ID != "----",] # removes the rows with "---"
dat = paste0(dat$Probe.Set.ID, ";", dat$Entrez.Gene) # pastes the columns together and adds the ";"
dat = gsub("///","\t",dat) # replaces the "///" with a tab
Also, use cat() to view the tab as opposed to "\t". I got that from here: How to replace specific characters of a string with tab in R. This will output a list as opposed to a data.frame. You can convert back with data.frame(), but then you cannot use cat() to view.
We can use dplyr and tidyr here.
library(dplyr)
library(tidyr)
> df <- data.frame(
col1 = c('A01157cds_s_at', 'A03913cds_s_at', 'A04674cds_s_at', 'A07543cds_s_at', '----'),
col2 = c('50682', '29366', '24860 /// 100909612', '24867', '----'))
> df %>% filter(col1 != '----') %>%
separate(col2, c('col2_first', 'col2_second'), '///', remove = T) %>%
unite(col1_new, c(col1, col2_first), sep = ';', remove = T)
> df
## col1_new col2_second
## 1 A01157cds_s_at;50682 <NA>
## 2 A03913cds_s_at;29366 <NA>
## 3 A04674cds_s_at;24860 100909612
## 4 A07543cds_s_at;24867 <NA>
filter removes the observations with col1 == '----'.
separate splits col2 into two columns, namely col2_first and col2_second
unite concatenates col1 and col2_first with ; as separator.

Lookup table with subset/grepl in R

I'm analyzing a set of urls and values extracted using a crawler. While I could extract substrings from the URL, I'd really rather not bother with the regex to do so—is there a simple way to do a lookup table-style replacement using subset/grepl without resorting to dplyr(do a conditional mutate on the vairables)?
My current process:
test <- data.frame(
url = c('google.com/testing/duck', 'google.com/evaluating/dog', 'google.com/analyzing/cat'),
content = c(1, 2, 3),
subdir = NA
)
test[grepl('testing', test$url), ]$subdir <- 'testing'
test[grepl('evaluating', test$url), ]$subdir <- 'evaluating'
test[grepl('analyzing', test$url), ]$subdir <- 'analyzing'
Obviously, this is a little clumsy and doesn't scale well. With dplyr, I'd be able to do something with conditionals like:
test %<>% tbl_df() %>%
mutate(subdir = ifelse(
grepl('testing', subdir),
'test r',
ifelse(
grepl('evaluating', subdir),
'eval r',
ifelse(
grepl('analyzing', subdir),
'anal r',
NA
))))
But, again, really goofy and I don't want to incur a package dependency if at all possible. Is there any way to do regex-based subsetting with some sort of lookup table?
Edit: Just a few clarifications:
For extracting subdirectories, yes, regex would be most efficient; however, I was hoping for a more general pattern that could match a dictionary-like struct of strings with other, arbitrary values.
Of course, nested ifelse is ugly and prone to error—just wanted to get a quick-and-dirty example with dplyr up.
Edit 2: Thought I'd loop back and post what I ended up with based upon BondedDust's approach. Decided to practice some mapping and non-standard eval while at it:
test <- data.frame(
url = c(
'google.com/testing/duck',
'google.com/testing/dog',
'google.com/testing/cat',
'google.com/evaluating/duck',
'google.com/evaluating/dog',
'google.com/evaluating/cat',
'google.com/analyzing/duck',
'google.com/analyzing/dog',
'google.com/analyzing/cat',
'banana'
),
content = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
subdir = NA
)
# List used for key/value lookup, names can be regex
lookup <- c(
"testing" = "Testing is important",
"Eval.*" = 'eval in R',
"analy(z|s)ing" = 'R is fun'
)
# Dumb test for error handling:
# lookup <- c('test', 'hey')
# Defining new lookup function
regexLookup <- function(data, dict, searchColumn, targetColumn, ignore.case = TRUE){
# Basic check—need to separate errors/handling
if(is.null(names(dict)) || is.null(dict[[1]])) {
stop("Not a valid replacement value; use a key/value store for `dict`.")
}
# Non-standard eval for the column names; not sure if I should
# add safetytype/checks for these
searchColumn <- eval(substitute(searchColumn), data)
targetColumn <- deparse(substitute(targetColumn))
# Define find-and-replace utility
findAndReplace <- function (key, val){
data[grepl(key, searchColumn, ignore.case = ignore.case), targetColumn] <- val
data <<- data
}
# Map over the key/value store
mapply(findAndReplace, names(dict), dict)
# Return result, with non-matching rows preserved
return(data)
}
regexLookup(test, lookup, url, subdir, ignore.case = FALSE)
for (target in c('testing','evaluating','analyzing') ) {
test[grepl(target, test$url),'subdir' ] <- target }
test
url content subdir
1 google.com/testing/duck 1 testing
2 google.com/evaluating/dog 2 evaluating
3 google.com/analyzing/cat 3 analyzing
The vector of targets could have instead been the name of a vector that is in the workspace.
targets <- c('testing','evaluating','analyzing')
for( target in targets ) { ...}
Try this:
test$subdir<-gsub('.*\\/(.*)\\/.*','\\1',test$url)

Subset all 3 digit numbers and collapse them with a separator in a data frame. R

I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))

read.dta convert.dates not working?

I have a Stata dataset, call it dataset.dta. I want to read it in R. I am using the package foreign. Problem is it fails to parse/convert Stata dates to R dates.
It goes something like this:
df <- read.dta( 'dataset.dta', convert.dates = TRUE )
# Check attributes
attr( df, "formats")
"%9s" "%8.0g" "%12.0g" "%12.0g" "%9.0g" "%21s" "%31s" "%td" "%td"
# Last two columns are dates i.e. %td
str( df )
... # Only showing last two columns
$ start_sample: num 15494 14246 14246 14670 14245 ...
$ end_sample : num 18262 18262 18262 18262 18262 ...
I was expecting Date class for these, instead of num. When I look into the source code of read.dta I find this.
if (convert.dates) {
ff <- attr(rval, "formats")
dates <- grep("%-*d", ff)
base <- structure(-3653, class = "Date")
for (v in dates) rval[[v]] <- base + rval[[v]]
}
Changing the third line here to dates <- grep( "%*d", ff) seems to take care of the issue. I changed the regex. I'm using Stata version 13.0.
Am I missing something? This just a bug or am I doing something woefully wrong here?
Two quick fixes/hacks. The first is
#### Convert to dates ####
datelookup <- format(seq(as.Date("1960-01-01"), as.Date("2015-12-31"), by = "1 day"))
df$start_sample_ dates <- datelookup[ df$start_sample + 1]
df$start_sample_dates <- datelookup[ df$end_sample + 1]
Stata uses 01/01/1960 as the base. The second is
#### Stealing from foreign package ####
ff <- attr(df, "formats")
dates <- grep("%*d", ff)
base <- structure(-3653, class = "Date")
for (v in dates) df[[v]] <- base + df[[v]]
Why structure(-3653, class = "Date") ? See comment #Dimitriy V. Masterov above. This issue could be specific to Stata version 13.0. See comment #dickoa above. Thanks for your help.