str_detect removing some but not all strings with specified ending - regex

I'd like to remove any string that ends in either of 2 characters in a pipe. In this example it's ".o" or ".t". Some of them get removed, but not all of them, and I can't figure out why. I suspect something is wrong in the 'pattern = ' argument.
ex1 <- structure(list(variables = structure(1:18, .Label = c("canopy15",
"canopy16", "DistanceToRoad", "DistanceToEdge", "EdgeDistance",
"TrailDistance", "CARCOR.o", "EUOALA.o", "FAGGRA.o", "LINBEN.o",
"MALSP..o", "PRUSER.o", "ROSMUL.o", "RUBPHO.o", "VIBDEN.o", "ACERUB.t",
"FAGGRA.t", "NYSSYL.t"), class = "factor")), row.names = c(NA,
-18L), class = "data.frame")
ex1 %>%
dplyr::filter(stringr::str_detect(string = variables,
pattern = c("\\.o$", "\\.t$"),
negate = TRUE))
##output
# variables
# 1 canopy15
# 2 canopy16
# 3 DistanceToRoad
# 4 DistanceToEdge
# 5 EdgeDistance
# 6 TrailDistance
# 7 EUOALA.o
# 8 LINBEN.o
# 9 PRUSER.o
# 10 RUBPHO.o
# 11 FAGGRA.t

The pattern has multiple elements, so it is recycling, and thus checking o$ for one row, and then t$ for the next row, and so on.. Try this instead:
ex1 %>%
dplyr::filter(stringr::str_detect(string = variables,
pattern = c("\\.(o|t)$"),
negate = TRUE))

For those not as well-versed in regular expressions, here is a simpler answer.
library(tidyverse)
ex1 %>% filter(str_detect(string = variables, pattern = ".t$", negate = TRUE),
str_detect(string = variables, pattern = ".o$", negate = TRUE))

Related

How do I include a regular expression in a function in R?

below I wrote a function which searches for specific regular expressions within a vector. The function always searches for regular expressions including "Beer" or "Wine" within a vector. Now I would like to include the regular Expressions I am searching for (In my case "Beer and Wine") as additional variables into the vector. How can I do this?
x <- c("Beer","Wine","wine","Beer","Beef","Potato","Vacation")
Thirsty <- function(x) {
Beer <- grepl("Beer",x, ignore.case = TRUE)
Beer <- as.numeric(Beer == "TRUE")
Wine <- grepl("Wine",x, ignore.case = TRUE)
Wine <- as.numeric(Wine == "TRUE")
Drink <- Beer + Wine
Drink <- as.numeric(Drink == "0")
Drink <- abs(Drink -1)
}
y <- Thirsty(x)
y
This can be done with the following code:
x <- c("Beer","Wine","wine","Beer","Beef","Potato","Vacation")
drinks <- c("Beer","Wine")
Thirsty <- function(x, drinks) {
Reduce("|",lapply(drinks, function(p)grepl(p,x, ignore.case = TRUE)))
}
y <- Thirsty(x,drinks)
y
lapply loops over the possibilities in drinks and produces a list of logical vectors, one for each drink. These are combined into a single vector by Reduce.
I would simply try to concatenate the match patterns with |
strings = c("Beer","Wine","wine","Beer","Beef","Potato","Vacation")
thirstStrings = c("beer", "wine")
matchPattern = paste0(thirstStrings, collapse = "|") #"beer|wine"
grep(pattern = matchPattern, x = strings, ignore.case = T)
# [1] 1 2 3 4
You can easily wrap that in a function
Thirsty = function(x, matchStrings){
matchPattern = paste0(matchStrings, collapse = "|") #"beer|wine"
grep(pattern = matchPattern, x = x, ignore.case = T)
}
Thirsty(strings, thirstStrings) # [1] 1 2 3 4
This should also work
Thirsty = function(vec, ...) {
pattern = paste0(unlist(list(...)), collapse = "|")
stringr::str_detect(tolower(vec), pattern)
}
> Thirsty (x, "beer", "wine")
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE

R - extract all strings matching pattern and create relational table

I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").

R: Regress all variables that match certain pattern

Is there a way in R to add all variables into a regression that match a certain pattern? For example, I have a bunch of variables in my dataset that correspond to holidays with the prefix h_ and I have other variables with other prefixes such as a_
Is there a way to do something like this:
lm(homicide ~ w_* + a_*, data= df)
To programmatically construct a formula, have a look at reformulate().
Here's an example that uses grep() to find all variables that begin with a "d" and then uses reformulate() to plug them in as the regressor variables on the RHS of a formula object.
vv <- grep("^d.*", names(mtcars), value=TRUE)
ff <- reformulate(termlabels=vv, response="mpg")
lm(ff, data=mtcars)
#
# Call:
# lm(formula = ff, data = mtcars)
#
# Coefficients:
# (Intercept) disp drat
# 21.84488 -0.03569 1.80203
A string can be turned into a formula.
data(iris)
fmla <- as.formula(paste("Species ~",
paste(grep("Width", names(iris), value = TRUE), collapse = " + ")))
glm(fmla, data = iris, family = binomial(link = "logit"))

How to include variables values into regular expressions in R

I have 5 files which contain metabolites (details of different bacteria models). I'm writing a function to append a specified number of files. File names look like the following.
[1] "01_iAPECO1_1312_metabolites.csv" "02_iB21_1397_metabolites.csv"
[3] "03_iBWG_1329_metabolites.csv" "04_ic_1306_metabolites.csv"
[5] "05_iE2348C_1286_metabolites.csv"
Below is my function.
strat = 3 # defines the starting position of the range
end = 5 # defines the ending position of the range
type = "metabolites" # two types of files - for metabolites and reactions
files <- NULL
if (type == "metabolites"){
files <- list.files(pattern = "*metabolites\\.csv$")
}else if(type == "reactions"){
files <- list.files(pattern = "*reactions\\.csv$")
}
#reading each file within the range and append them to create one file
for (i in start:end){
temp_df <- data.frame(ModelName = character(), Object = character(),stringsAsFactors = F)
#reading the current file
temp = rbind(one,temp_df)
}
#writing the appended file
write.csv(temp,"appended.csv",row.names = F,quote = F)
temp_df <- NULL
For example, if I specify the start=3 and end = 5, the code is supposed to read files 03, 04 and 05 and append them. Note: the two integers at the beginning of the file names are used to get the file referenced by the range. I'm unable to select the required file within the for loop using a regular expression. When I specify the number it picks up but I'm looking for a generalized version with i in it.
currentFile = grep("01.+",files)
Any help is appreciated.
For the test data shown below this returns a vector containing the file names of the files that start with 02, 03, 04 and 05 and end with "reactions.csv"
# create some test files
for(i in 1:5) cat(file = sprintf("%02djunkreactions[.]csv", i))
# test input
start <- 2
end <- 5
type <- "reactions"
list.files(pattern = paste(sprintf("^%02d.*%s[.]csv$", start:end, type), collapse = "|"))
giving:
[1] "02junkreactions.csv" "03junkreactions.csv" "04junkreactions.csv"
[4] "05junkreactions.csv"
Note: If start and end are both always one digit then a simplification is possible:
list.files(pattern = sprintf("^0[%d-%d].*%s.csv$", start, end, type))
You can do this with a cross-join.
library(dplyr)
library(stringi)
start = 3
end = 5
type = "metabolites"
all_files = data_frame(file = list.files() )
desired_files = data_frame(
number = start:end,
regex = sprintf("^%02.f.*%s", number, file_type) )
all_files %>%
merge(desired_files) %>%
filter(stri_detect_regex(file, regex)) %>%
group_by(number) %>%
do(read.csv(.$file) ) %>%
write.csv("appended.csv", row.names = F, quote = F)
Are you looking for something like this?
files <- c("01_iAPECO1_1312_metabolites.csv", "02_iB21_1397_metabolites.csv","03_iBWG_1329_metabolites.csv", "04_ic_1306_metabolites.csv","05_iE2348C_1286_metabolites.csv")
for(i in 2:4) print(grep(sprintf("^(%02d){1}_",i),files,value=T))

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]